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Abstract 



In real distributed systems, processes may have only inexact information about the 
amount of real time needed for primitive operations such as process steps. This thesis 
studies the effect of this timing uncertainty on the real-time behavior of distributed systems. 
We consider a semi-synchronous model in which the amount of real time between process 
steps is known to be in the interval [ci, c 2 ] and every message is known to be delivered within 
time d of when it is sent. We use C = c 2 /ci as a measure of the timing uncertainty. 

We hrst study the problem of reaching agreement in the presence of failures. A simple 
argument derived from the case of synchronous processes shows that at least time (/ + I) d 
is required to tolerate / failures, while time (/ + l)Cd is sufficient to tolerate / stopping or 
omission failures by directly simulating the rounds of any synchronous consensus algorithm. 
We narrow this gap for omission failures, building on the nearly optimal algorithm of Attiya, 
Dwork, Lynch, and Stockmeyer which tolerates only stopping failures. If fewer than half the 
processes are faulty (n > 2/ + I), then the running time of our algorithm is 4(/ + l)d + 
Cd, which is within a factor of 4 of optimal and has minimal dependency on the timing 
uncertainty factor C. If more than half the processes are faulty, then a more complicated 
analysis shows the running time is increased by approximately a factor of min(^r, y/C). We 
also present a general simulation for n > 3/ + I tolerant of Byzantine failures that simulates 
any synchronous algorithm at a cost of time 2Cd + d per round. 

Finally, motivated by the message inefficiency of our consensus algorithm for omission 
failures, we define a more realistic model of message links by limiting their capacity. If 
messages are sent too frequently on these message links, they may incur delay greater 
than d. For message links with capacity //, we prove nearly tight upper and lower bounds of 
min(2Co? + d, C 2 d/ /j, + Cd + d) and min(2Co? + d/ fi, C 2 d/ /j, + Cd + d) respectively for the 
time needed to detect stopping failures. 
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Chapter 1 



Introduction 



In real distributed systems, processes are likely to be neither perfectly synchronous nor com- 
pletely asynchronous. Many systems lie somewhere between these two extremes and can thus 
be more accurately modeled by a semi-synchronous model in which processes have inexact 
knowledge about real time. In our model, the degree of asychrony is captured by a parameter 
which we call the processes' timing uncertainty. We will be particularly interested in how 
the magnitude of timing uncertainty affects the time complexity of distributed computing 
problems. In particular, we study the the time needed to reach consensus in the presence 
of omission failures and in the presence of Byzantine failures. We also introduce a model 
of message links with bounded- capacity and study the time needed to detect failures in a 
system using these message links. 

In a synchronous system, processors have perfectly synchronized clocks and distributed 
algorithms are often broken up into rounds of communication. In a single round of com- 
munication, each processor may receive messages from other processors, perform some local 
computation, and then send messages to other processors. The time required to perform 
local operations is generally assumed to be negligible and the time complexity of algorithms 
is therefore measured by the number of rounds of communication required. In an asyn- 
chronous system, the delay of messages is arbitrary and unbounded (or the relative rates 
of different processors are unbounded). The time complexity of an asynchronous algorithm 
is usually measured by letting one time unit equal the maximum delay of any message 
([Gal82, Awe85]). 

The model we use is a slightly simplified version of the semi-synchronous model intro- 
duced in [AL89], which is in turn based on the formal model of timed automata in [MMT90]. 
In this model, processors have inexact knowledge about the time needed to perform certain 
primitive operations. The model is formally described in Section 2.1, but is very simple: 
every message is delivered within time d of when it is sent and the amount of time between 
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any two consecutive steps of any process is in the interval [ci,c 2 ]. Because process steps are 
the only events for which there is a lower bound, a process can deduce a lower bound on 
the amount of time for any interval of events only by counting the number of steps it takes 
in that interval. For instance, to ensure that time d elapses over an interval of events, a 
processor must count dj C\ of its local steps, after these events it knows that at least time 
C\ ■ d/ci = d (and at most c 2 • d/ci) has elapsed. We will be particularly interested in how 
this timing uncertainty factor of c 2 /ci, henceforth denoted C, affects the time complexity of 
problems relative to their synchronous time complexity. 

Of particular interest are problems that are intractable in an asynchronous setting yet 
have solutions with tight bounds in the synchronous setting. A simple example is the basic 
task of detecting the failure of stopped processes. Clearly, if there is no bound on message 
delay or relative process step time, then failures can never be detected with certainty; in a 
synchronous system, any stopping failure can be detected within approximately the maxi- 
mum message delay time. Another natural candidate is the consensus problem. It is well 
known that a completely asynchronous algorithm for consensus cannot tolerate the failure 
of even one process, whereas exactly / + 1 rounds of synchronous communication are needed 
to tolerate / failures in a synchronous system. 

1.1 Reaching consensus — known time bounds 

The problem of reaching consensus in the presence of failures is one of the most well-studied 
problems in distributed computing. We consider the version of this problem for a system 
of n deterministic processes some / of which may fail, completely connected by a reliable 
message system. The processes begin executing at the same time, each with a private binary 
input, and must each decide on a binary value such that no two nonfaulty processes decide 
differently and if all processes begin with value v then v is the decision of all nonfaulty 
processes. In this thesis, we consider two kinds of process failure: send-omission failures, by 
which a process may unwittingly omit messages of an algorithm, and Byzantine failures, by 
which a process may exhibit arbitrary behavior. 

It is well known ([FLP85]) that in an asynchronous system, this problem cannot be solved 
deteministically even if the only failure to be tolerated is the unannounced halting (stopping) 
of a single process. The work of [DDS87] methodically explores the synchrony necessary to 
reach consensus; they show that if there is no upper bound on message delay or there is no 
upper bound on the relative rate of process steps — if any of our bounds d, ci, or c 2 does not 
hold — then there is no deterministic solution tolerating even a single stopping failure. 

The time complexity of the consensus problem has been well studied in the synchronous 
rounds model (see, for example, [LSP82, PSL80, FL82, DS83, DLM82]). It is well known 
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that / + 1 rounds of communication are both sufficient ([PSL80]) and necessary ([FL82, M85, 
DM86, CD86]) to reach consensus, regardless of the severity of failures (stopping, omission, 
or Byzantine). In [DLS88], the problem was studied using a model of partial synchrony in 
which upper bounds on message delivery time and/or processes' relative step rates exist, 
but they are unknown a priori to the processes. The algorithms of [DLS88] are concerned 
with fault tolerance rather than timing efficiency, and therefore translate to relatively slow 
algorithms for our model. 

For our semi-synchronous model, a lower bound of (/ + l)c? is implied by the synchronous 
lower bound of / + 1 rounds, via a straightforward transformation of any algorithm for our 
model to an an algorithm for the synchronous model. For stopping and omission failures, 
any synchronous round-based algorithm may be simulated directly, yielding an algorithm 
for our model with a running time approximately C times the synchronous running time. 
This simulation strategy is described in Section 3.1. Thus, upper bounds of approximately 
(/ + l)Cd are easily derived. For Byzantine failures, it is not clear how to simulate a 
synchronous algorithm correctly. 

In [ADLS90], Attiya, Dwork, Lynch, and Stockmeyer prove nearly tight upper and lower 
bounds on the time to reach consensus in the presence of stopping failures. Surprisingly, 
they give a clever algorithm for consensus that runs in time 2fd + Cd, much faster than 
a direct simulation when C is large. They also show a lower bound of (/ — l)d + Cd in a 
proof that combines the arguments of the synchronous lower bound with techniques from 
asynchronous lower bounds and retiming techniques for our semi-synchronous model. 

1.2 Related work 

Current research also concentrating on the real time complexity of the consensus problem 
appears in [SDC90]. There, processes are assumed to have clocks that are synchronized to 
within a fixed additive error. In contrast to our results, the results of [SDC90] are stated 
in terms of process clock time, not absolute time. The relationship between those results 
and ours is unclear; a better understanding of the differences between two different models 
is posed as a direction for further research in Section 6.2. 

A related model is studied in [HK89] to explore the time complexity of detecting failures 
along a network path. This model assumes synchronous processes but differentiates between 
the (known) a priori worst-case bound on message delay, A, and the (unknown) actual worst- 
case message delay in a given execution, 8. Since 8 may be much less than A, it is desirable 
for algorithms to have minimal dependency on A. This model raises a concern similar to 
that raised by our model: detecting the absence of a message may be much more costly than 



receiving the message. Our algorithms run equally well in this model; we remark on how our 
bounds translate to this model in Section 6.1. 

Other work in this area includes the extensive literature on clock synchronization algo- 
rithms (see [SWL86] for a survey). Other problems recently studied in our model of timing 
uncertainty include the problem of mutual exclusion ([AL89]) and the complexity of a net- 
work synchronizer algorithm ([AM90]). 



1.3 Results of this thesis 

1.3.1 Consensus in the presence of omission failures 

In Chapter 3, we strengthen the algorithm of [ADLS90] to tolerate omission failures. The 
resulting algorithm has a running time of 4(/ + l)e?+Ce? for n > 2/ + 1. This is approximately 
within a constant factor (4) of the lower bounds of (/ + l)d and (/ — l)d + Cd ([ADLS90]) 
and minimizes the dependence on the timing uncertainty C. 

For n < 2/, a more involved analysis bounds the running time by two different quantities 
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simultaneously: one bound is dependent on the ratio ^y and the other is dependent on \/C . 
We hrst derive the bound (3^r + 5)(/ + \)d + Cd using a finer analysis that is similar in 
spirit to the analysis for n > 2/ + 1. We then show that (2yC + 6)(/ + l)c? + Cd is also a 
bound on the running time using a simple but different argument. 



1.3.2 Consensus in the presence of Byzantine failures 

In Chapter 4, we present a simulation algorithm using 3/ + 1 processes and tolerating / 
arbitrary failures. The algorithm simulates any synchronous round-based algorithm tolerant 
of / arbitrary failures using roughly time 2Cd + d per round. 

The simulation works by keeping processes loosely synchronized to ensure that a nonfaulty 
process does not advance to round r until it has received a round r — 1 message from every 
nonfaulty process. The partial synchronization works by using a combination of two criteria 
for advancing to further phases, one based on elapsed local time and the other based on 
messages received. 

It follows that any of the known synchronous consensus algorithms tolerating / Byzantine 
failures and taking / + 1 rounds can be run in our model in time (/ + l)(2Cd + d). 
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1.3.3 Timeouts using bounded-capacity message links 

In Chapter 5, we define a realistic restriction on the message links of our model and examine 
its effect on the time needed to detect stopping failures. According to the model of [AL89] 
and [ADLS90] (used in Chapter 3), every message sent by a process is delivered within time 
d of when it is sent, regardless of the rate at which messages are sent. In reality, if a link is 
flooded with messages, their delay may be much greater. Our algorithm for omission failures 
and the algorithm of [ADLS90] ignore this consideration by requiring a process to send a 
message at every step it takes. This enables failures to be detected as quickly as possible, 
but is grossly inefficient in its use of messages. We therefore define a more realistic model of 
message delay that takes into consideration the rate at which messages are sent. 

We give a clean, modular definition of a message link of arbitrary capacity //. Such a link 
my may be thought of as allowing the "progress" of only // messages at any time. We then 
derive nearly tight bounds on the time needed to detect a stopping failure using such links. 
Two easy algorithms guarantee that the time between a failure and its detection is at most 
2Cd + d and C 2 d/ /j, + Cd + d, respectively. We show that these bounds are nearly optimal 
by proving a lower bound of the lesser of 2Cd + d/ fi and C 2 d/ /j, + Cd + d. 



II 



Chapter 2 



Model and Definitions 



Our underlying formal model is essentially the same as that used in [ADLS90]. Our model 
differs by assuming for ease of presentation that all messages are delivered in the order sent 
and that processes begin executing the algorithm at the same time. The former assumption 
is not used in our algorithm for Byzantine failures and is easily removed from our algorithm 
for omission failures by employing a more complicated protocol for receiving messages. The 
latter assumption is avoided in [ADLS90] by instead providing a special individual input 
event for each process, in which it receives its initial value for the consensus protocol. In 
measuring the time complexity of the algorithm, time is measured only from the earliest 
time that all processes have received an input. Using the same formalism, our algorithm for 
omissions failures works equally well without the assumption of a synchronized start. This is 
not true, however, for our algorithm tolerating Byzantine failures, where we make use of the 
fact that all nonfaulty processes begin executing the algorithm at the same time. Without 
this assumption, the problem is complicated by the need to determine when all processes 
have received inputs. Also, in addition to allowing stronger failures than [ADLS90], we 
assume that processes know the number of failures, /, to be tolerated. 



2.1 Formal model 

We consider a system of n processes 1, . . . , n. Each process is a deterministic state machine 
with possibly an infinite number of states and a distinguished start state. 

A configuration is a vector C consisting of the local states of each process. Let st(i,C) 
denote the state of process i in configuration C. We model a computation of the algorithm as 
a sequence of configurations alternated with events. Each event tt is either the computation 
step of a single process or the delivery of a message to a process. The local protocol of process 
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i consists of two transition functions, Mi for message delivery events, and Si for computation 
events. Transition function Mi is applied to a state of the process and a message (taken 
from some finite message alphabet) and returns a state. (So, for example, a process can 
"remember" a message that was delivered to it.) A message delivery event tt is of the form 
(m,z), specifying the message m delivered and the recipient process, i. Transition function 
Si is a applied to a state of the process and returns a state and a finite set of messages to 
be sent. 1 A computation step tt is of the form (z, Af), specihng the process i taking the step 
and the set of messages M it sends in that step. (Af should be interpreted as the messages 
the process actually sends at that step in the execution; if the process is faulty, this may not 
correspond to those determined by the transition function.) 

An execution is an infinite sequence of alternating configurations and events, a = Co, 7i"i, 
Ci, . . . , 7Tj, Cj, . . ., where Co is the vector of start states and each configuration C; follows 
from the previous configuration C 8 _i and the intervening event 7r 8 -, according to the state 
transitions of the process at which event 7r 8 - occurs. This means that if event tvj is an event 
at process x then (f ) for y ^ x, st(y, Cj_i) = st(y, Cj), (2) if tt is a message delivery event 
specifying the delivery of message m then st(i } Cj) is the result of applying Mi to Cj_i and 
m, and (3) if tt is a computation event, then st(i,Cj) is the result of applying Si to Cj-\. 
Also, each message sent is delivered after it is sent and no unsent "messages" are delivered. 

A timed event is a pair (7r,t), where tt is an event and t, the "time", is a nonnegative 
real number. A timed sequence is an infinite sequence of alternating configurations and 
timed events a = Co, (7i"i, ti), Ci, . . . , (7Tj, tj), Cj, . . ., where the times are nondecreasing and 
unbounded. 

Fix real numbers ci, c 2 , and d, where < c\ < c 2 < oo and < d < oo. Letting a be a 
timed sequence as above, we say that a is a timed execution if 
f . Co, 7Ti, Ci, . . . , 7Tj, C;, . . . is an execution; 

2. The hrst step of each process is at time 0; 

3. There are infinitely many computation steps for each process; 

4. If TTi and tvj are consecutive computation steps of the same process, then c\ < tj — 1{ < 
c 2 ; and 

5. If message m is sent to process i during computation event tvj then it is delivered to 
process i during message delivery event 7Tfc, j < k } such that < tk — tj < d. 

In our timing analysis (but not in our algorithms or correctness proofs), we make the 
assumption that c 2 <^ d and therefore make the approximation d + c 2 ss d. 



Tn all our algorithms, a process always sends the same message (at most one per step) to all processes, 
including itself. 
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2.1.1 Omission failures 

A process i suffers an omission failure in execution a if and only if there is a computation 
step 7Tj of process i in a specifying a set of messages that is a strict subset of the messages 
determined by the transition function Si applied to st(i,Cj-i). Recall that computation 
step 7Tj specifies the messages actually sent by i during that step of execution a. Note that 
according to our definition of an execution, st(i,Cj) must be the result of applying Si to 
st(i } Cj_i), regardless of the messages specified by tvj. This implies that the process itself is 
"unaware" of its failure and, unless informed about it, continues executing as if it had not 
failed. (This kind of failure is sometimes called a send-omission failure.) If the algorithm 
requires j to broadcast a message to all processes, but j does not send a message to z, then 
we say that "j omits to z" or that this broadcast is "unsuccessful". 

2.1.2 Byzantine failures 

A process suffers a Byzantine failure if it changes its state or sends messages in a way not 
specified by the transition functions of the algorithm. No restrictions are made on its state 
transitions or what messages it sends, and so it may exhibit arbitrary behavior. Furthermore, 
the time between successive steps of a faulty process might not be in the interval [ci,c 2 ]. 
The messages it sends, however, are delivered within time d of when they are sent. 

2.1.3 Consensus 

Finally, we define the consensus problem. We assume that each process begins with an 
initial binary value (its "input") as part of its local state and may irreversibly "decide" on 
a value by entering a specially designated state. The problem is for the processes to agree 
on a binary value despite the failure of some processes. We say that a timed execution a is 
f- admissible if at most / processes fail in a. An algorithm solves the consensus problem for 
f failures within time T provided that for each of its f- admissible timed executions a, (1) no 
two different processes decide on different values (agreement), (2) if some nonfaulty process 
decides on v, then some process has initial value v (validity), and (3) every nonfaulty process 
decides by time T (time bound). Note that the validity condition does not imply termination; 
termination is implied by the third condition. We consider the binary version of the problem, 
where the initial values are or 1 . Like the algorithm of [ADLS90] , our algorithm for omission 
failures can be extended to work for any value set, using the same extension given there 
([ADLS90], Section 5.4). Our algorithm for Byzantine failures is a general simulation for 
any rounds based algorithm and therefore can simulate any synchronous agreement algorithm 
for any value set. 
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Chapter 3 

Consensus in the Presence of 
Omission Failures 



In this chapter, we present a consensus algorithm tolerant of send-omission failures. The 
algorithm uses the same strategy as that of [ADLS90]; we hrst elucidate this strategy by 
describing a synchronous consensus algorithm upon which it is based and explaining our 
algorithm in terms of that synchronous algorithm. For n > 2/ + 1, the running time of our 
algorithm is 4(/ + l)d-\- Cd } which is approximately within a factor of 4 of the lower bounds 
of it — l)d + Cd and (t + l)d ([ADLS90]). For n < 2/, the running time is bounded by two 
quantities, (3^j + 5)(/ + l)d + Cd and (2^/C + 6)(/ + l)d + Cd - 

In order to motivate the work presented here, we hrst discuss bounds attainable by more 
straightforward algorithms. 



3.1 Straightforward upper bounds 

Attiya, Dwork, Lynch, and Stockmeyer ([ADLS90]) give two simple algorithms tolerant of 
stopping failures and with running times of roughly fCd. One algorithm is based on a 
method for simulating any synchronous round-based algorithm; the other is specific to the 
consensus problem and requires that the processes begin synchronized. Both algorithms can 
be modified to tolerate omission failures without seriously affecting the running times. We 
briefly explain these two simple algorithms with the modifications. 

The first simple algorithm simulates any synchronous round-based algorithm and takes at 
most time Cd + d per round. The algorithm works by executing the round-based algorithm 
in parallel with a timeout task. The timeout task is similar to the one described at the 
beginning of Chapter 5: each process keeps a count of the number of steps it has taken and 
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at each step broadcasts the number of its current step to all other processes in the form 
"I'm alive: s" at step number s. Each process also keeps track of the "I'm alive" messages 
received from other processes and detects failures in the expected way, by detecting gaps in 
the step numbering or by the absence of messages. (We will in fact employ this strategy in 
our algorithm.) While performing the timeout task, a process simulates each round of the 
synchronous algorithm by asynchronously executing it — a process simply waits indefinitely 
on every other process for either a message of that round or the detection of that process's 
failure. It is not hard to see that this accurately simulates the round-based algorithm: no 
process sends a round r message before receiving a round r — 1 message from all nonfaulty 
processes; A simple inductive argument shows that by time r(Cd + d) (more accurately, 
time C(d-\- c 2 ) + (d -\- c 2 )), every process has finished simulating round r of the synchronous 
algorithm. Thus, any synchronous consensus algorithm tolerant of omission failures taking 
/ + 1 rounds may be directly simulated to yield an algorithm for our semi-synchronous model 
that takes time (/ + l)(Cd + d). 

Under the assumption that processes begin executing the algorithm at the same time, a 
simpler algorithm specific to the consensus problem may be used. This simpler algorithm 
does not make use of any fault-detection mechanisms. If a process starts with initial value 
1, it broadcasts a 1 and decides 1 and halts. If a process ever receives a 1 (and has not yet 
halted), it does the same. It is easy to see that if a correct process receives a 1, then some 
correct process receives a 1 by time fd and subsequently all correct processes receive a 1 
by time (/ + l)d (more accurately, (/ + l)(d + c 2 )). Therefore a process may decide if it 
has run for more than (/ + l)(d + c 2 )/ci steps without deciding. This takes at most time 
approximately (/ + l)Cd. 

Finally, we remark that the efficient algorithm of [ADLS90] can be modified to tolerate 
omission failures by using the timeout task for omission failures outlined above. The running 
time, however, is then roughly f 2 d-\- Cd. This bound follows from a modification of the part 
of the analysis of [ADLS90] which takes the sum over each phase r of the number of processes 
that fail during the sending of an r message. Because only stopping failures are considered 
in [ADLS90], the analysis there concludes that a process may fail during the sending of at 
most one r message and therefore the sum over all r is at most /. If failures are by omission, 
then a process may fail during the sending of many r messages, but only once for any r. 
Because there are at most / + 2 phases in any /-admissible execution, the sum over all r is 
at most (/ + 1)/, resulting in a bound of approximately (/ + l)fd + Cd. 
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3.2 Intuition: the underlying synchronous algorithm 

Our algorithm and the algorithm of [ADLS90] may be interpreted as simulations of an 
underlying synchronous algorithm. In this underlying synchronous algorithm, all processes 
begin executing in round 0. In even numbered rounds, processes may decide only on 0; in 
odd numbered rounds, processes may decide only on 1. In round 0, any process with initial 
value decides immediately and broadcasts a message saying "I decided in round 0"; any 
process with initial value 1 broadcasts a message saying "I didn't decide in round 0" and 
advances to round 1. In any subsequent round r, if a process did not receive a message 
saying "I decided in round r — 1", it may decide r mod 2, broadcasting "I decided in round 
r"; if it did receive a message saying "I decided in round r — 1", it advances to round r + 1 
broadcasting "I didn't decide in round r". 

It is easy to see that if a nonfaulty process decides in round r then no process decides 
in round r + 1 and all processes then decide in round r + 2. The algorithm is also "early- 
stopping": any execution in which at most / processes fail takes at most / + 2 rounds of 
communication. (This means that all processes decide in round / + 2 or earlier, despite the 
fact that the hrst round is numbered 0, since a decision in round i is based on messages sent 
in round i — 1 or earlier.) The is easily seen by observing that if an execution takes x rounds 
then a faulty process decides in each of rounds through x — 3: if no faulty process decides 
in round i < x — 3 then either (1) a nonfaulty process decides in round i and all processes 
decide by round i + 2, or (2) no process decides in round i and therefore they all decide in 
round i + 1 (because no process receives an "I decided in round z" message). Thus, / failures 
cause the maximum number of rounds, / + 2, in the following execution. All processes 
except some process j begin with initial value 1 and advance to round 1. Process j 0} with 
initial value 0, broadcasts "I decided in round 0" to all processes except some other process 
j\. Thus all processes except j\ advance to round 2; j\ decides in round 1 and broadcasts 
"I decided in round 1" to all processes except some process j 2 . This continues until finally 
process jf-i decides in round / — 1 and broadcasts "I decided in round / — 1" to all processes 
except nonfaulty process jj, which decides in round /+ 1; all processes subsequently decide 
in round / + 2. 

Both our algorithm and that of [ADLS90] "simulate" this synchronous algorithm, making 
several important optimizations in order to improve the running time for our model. If during 
the simulation of round r, a process receives a message saying "I decided in round r — 1", 
it immediately advances to round r + 1 (without waiting for round r — 1 messages from 
other processes), broadcasting to all processes, in effect, "I know of a process that decided in 
round r — 1". Other processes in round r that receive this message relay it to all processes 
and also advance immediately to round r + 1. A process may decide in round r only if it 
can be sure that no nonfaulty process decided in round r — 1. This is ascertained only when, 
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for every other process p } either (1) the message "I didn't decide in round r — 1" is received 
from p } or (2) p has been detected as faulty (by the timeout protocol), or (3) for some 
r' < r — 1, the message "I decided in round r'" has been received from p (also remembered 
by the timeout protocol). 

The key to the improved efficiency of our algorithm relative to that of [ADLS90] is the 
addition of a mechanism for a process to detect its own failure. We require that a process 
receive at least n — f acknowledgments for every message of the synchronous algorithm that 
it sends. Until a process has received a sufficient number of acknowledgments for its round 
r message, it is prohibited from deciding in round r + 1 or advancing to round r + 2. This 
is important to the efficiency of the algorithm because it limits to 1 the number of times a 
faulty process can omit a message of the synchronous algorithm to all nonfaulty processes. 
For n > 2/ + 1, the convention of waiting for acknowledgments ensures that a faulty process 
does not advance to round r + 1 if it omits to all nonfaulty processes a message saying "I 
know of a process that decided in phase r". If it does send such a message to a nonfaulty 
process, that nonfaulty process in turn relays it to all other processes; the faulty process 
therefore has not delayed the algorithm by very much (time d at most). The convention 
of waiting for acknowledgments requires that a process continue executing the algorithm, 
sending acknowledgments, after it has decided. 

3.3 The algorithm 

We hrst explain the presentation of our algorithm. We describe our algorithm as the parallel 
composition of a fault-detection protocol and a main algorithm. At each step, a process 
hrst executes the code of the fault-detection protocol, then executes the code of the main 
algorithm, and finally sends a message. (Recall that in our model a process may send at 
most one message at each step). 

This message is the concatenation of possibly several component "messages" which are 
specified by the queue commands in the code: if during a step, the statement "queue 'ra' " 
is executed in the code, then "message" m is a component of the message sent at the end 
of that step. We will refer to a message by any one of its components: we will say "an m 
message" or simply "an ra" to refer to any message with m as one of its components. 

Our model also specifies that a process receives messages only during delivery events (and 
therefore only between process steps). For every delivery event, a process changes its state 
by adding the received message to a buffer (an unordered set). At its next step, the process 
reads and empties this buffer. A conditional statement in the code referring to the receipt 
of a message checks whether such a message was read from this buffer during the given step. 
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For ease of presentation, some components of a process's state are not explicitly named 
or maintained in the code — for instance, the number of steps a process has taken, whether it 
has decided, or whether it has sent a certain message. Process index subscripts are omitted 
in the code but used in the text (e.g., "-D;") to refer to a local variable (D) of process i. 

3.3.1 The fault-detection protocol 

In order to tolerate omission failures, our algorithm employs the timeout protocol described 
in Section 3.1. A process sends a message at every step that it takes, consecutively numbering 
all messages that it sends with the number s of its current step. 1 Before a process decides, 
the message that it sends at every step is of the form "I'm alive: s", where s is the number 
of its current step; after a process decides, the message is of the form "I've decided: s". 
The failure of a process can thus be detected by a gap in the sequence numbering (recall 
we assume that message links deliver messages in the order sent) or by the absence of any 
messages for too long a period of time (more than time d + c 2 ). 

All processes detected as faulty are added to a local set F. When a process i detects the 
failure of another process j, it broadcasts this fact in the form of a "shutdown j" message. 
Upon receiving this message, other processes add j to their respective sets F; when process 
j receives this message, it halts, ceasing its execution of the algorithm. The timeout protocol 
also keeps track of which processes have decided. When a process receives a message "I've 
decided: s" from another process, it adds that process to its set D. When a process i adds 
j to Di (Fi } resp.), it is said to have "detected" that j has decided (failed, resp.). We say 
that a process i is shut down at time t if it receives a "shutdown z" message at time t. The 
code for the fault-detection protocol is in Figure 3.1. 

We now verify two basic properties of the fault-detection protocol with respect to arbi- 
trary executions. The hrst bounds the time by which a failure is detected. 

Lemma 3.1 If at time t, process j omits a message to process i, and i is not shut down by 
time t + C(d + c 2 ) + ( d + c 2 ) ~ t + Cd + d, then i adds j to Fi by that time. 

Proof: Let Sj be the step number of j at which it omits a message to i. The lemma is 
clearly true if j sends a message to i at a step numbered greater than Sj and that message 
arrives at i by time t + C(d-\-c 2 ) + (d-\- c 2 ). If j does not send such a message, then i receives 
no message from j between time t-\-d and t-\-d-\-c 2 (l + (d-\-c 2 )/ci) = t-\- (d-\-c 2 ) + C(g? + c 2 ), 
in which time i takes more than (d + c 2 )/ci steps and, since it is not yet shut down, adds j 
to Fi. ■ 



: As a consequence of the bound on running time to be derived, these sequence numbers are bounded by 
a function of /, d, c\ and C2. 
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STEP s: If "shutdown z" received, then halt. 

If decided, then queue "I've decided: s" 

else queue "I'm alive: s". 
For each j ^ D U F, 

if "shutdown j" message received 

then F<-FU{j}; queue "shutdown j" 
if "I'm decided: Sj v message received from j 

then D^DU {j} 
if "I'm alive" messages from j not numbered consecutively 

then F<-FU{j}; queue "shutdown j" 
if no message received from j and more than 

(d + c 2 )/ci steps taken since last message received from j, 

then F<-FU{j}; queue "shutdown j". 

Figure 3.1: The fault-detection protocol for i at step number s. 



The second property verifies that nonfaulty processes are never declared faulty. 

Lemma 3.2 If process i does not fail in an execution, then i is not added to any set Fj and 
is never shut down. 

Proof: For contradiction, let j be the process that first adds i to its failed set Fj. Process j 
adds i to Fj because either it receives a "shutdown z" message, or it receives two "I'm alive" 
messages from i with a gap in sequence numbering, or it does not receive an "I'm alive" 
message from i for more than (d + c 2 )/ci steps. 

By our choice of j, process j cannot receive a "shutdown z" message before adding i to 
F rj — that would imply that some other process added i its failed set before j did. 

Because i is nonfaulty (and the links are FIFO), j does not receive two "I'm alive" 
messages with a gap in the sequence numbering. 

Before it decides, i sends "I'm alive" messages at every step it takes and so any two 
messages are delivered to j at most time d-\-c 2 apart (if one message is delivered immediately 
and the following message is delayed by d). In time d + c 2 , j can take at most (d + c 2 )/ci 
steps and therefore does not add i to Fj. After i decides, it broadcasts an "I'm decided" 
message, which causes j to add i to Dj and prevents j from adding i to Fj thereafter. Thus, 
j cannot add i to Fj. ■ 



20 



3.3.2 The main algorithm 

The main algorithm is basically an asynchronous version of the synchronous algorithm of 
Section 3.2. The code for the main algorithm appears in Figure 3.2. We call the simulation 
of round r of the synchronous algorithm "phase r". Each process i starts in phase with V{ 
set to its own private value (1 or 0). In its hrst step, a process either decides or advances to 
phase 1. As with the synchronous algorithm, in even numbered phases a process can decide 
only 0, and in odd numbered phases a process can decide only 1. 

When a process advances from phase r to phase r + 1, it broadcasts an "r" message. (This 
is the equivalent of the message "I didn't decide in round r" in the synchronous algorithm). 
When a process decides in phase r, it broadcasts an "r + 1" message. (The message "r + 1" 
replaces the messages "I decided in round r" and "I didn't decide in round r + 1" of the 
synchronous algorithm; both have the meaning "I know of a process that decided in round 
r", so it is unnecessary to distinguish between them.) Set M r contains those processes from 
which an r message has been received. A process may decide in phase r only if it has (1) not 
yet received an r message, and therefore does not know of a process that decided in round 
r, and (2) has received an r — 1 message from all processes not yet detected as faulty or 
decided, indicating that they did not decide in round r — 1. If process i is nonfaulty, then 
the receipt of an r + 1 message from i prevents other processes from deciding in phase r + 1 
since they do not add i to D or F before receiving it. A process that decides in round r does 
not send an r message unless it receives one hrst (this implies that some process decided in 
round r — 1 but failed). 

Our convention of acknowledging messages works as follows. Each process maintains a 
set A r containing those processes from which a properly sequenced "ack(-,r)" message has 
been received. (The restriction to properly sequenced "ack(-,r)" messages is achieved by 
not adding a process to A r if that process is already in F. This restriction is necessary only 
for the bound when n < 2f.) Until a process decides, it sends exactly one acknowledgment 
message, "ack(j, r')", for each r' message where r' is less than its current phase number. After 
a process decides in some phase r, it continues to acknowledge r' messages for r' < r + 1. 
This is implemented in the code by allowing the process to advance to phase r + 2 but no 
further. It is not necessary for a process to acknowledge r' messages for r' > r + 1 because 
as we will see, if it is nonfaulty then other nonfaulty processes do not advance to phase r + 3 
without deciding and therefore do not require acknowledgments for their r + 2 messages. 
Until a process has received at least n — f properly sequenced acknowledgments for its r 
message (|A r_1 | > n — /), it may not advance to phase r + 1 or decide in phase r. 

Definition 1 A process i is blocked in phase r (for r > 0) if it advances to phase r without 
deciding and never has \A\~ \ > n — f . 
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Being blocked is a permanent state, but even if a process is not blocked in phase r, it may 
be temporarily delayed from advancing to phase r + 1 as it waits for acknowledgments before 
proceeding. 

PHASE 0: If v = 1, then queue "0" and goto PHASE 1. 

If v = 0, then queue "1" and decide and goto PHASE 2. 

PHASE r > 0: For each j and each r', 1 < j < n and < r' < r, 
if "r"' message received from j, 

then M r ' <- M r ' U {j} 
if "ack(z, r — 1)" received from j and j G - i* 1 , 

then A 1 "" 1 <- A 1 "" 1 U {j} 
if j G Af r and r' < r and "ack(j,r')" not yet sent, 

then queue "ack(j,r')". (whether decided or not) 

If decided and M r ~ 2 ^ and "r — 2" not yet sent, 

then queue "r — 2" 
If not decided and |A r_1 | >n — f, (enough ack's received) 

then if M r ^ 0, (some process decided in phase r — 1) 

then queue "r" and goto PHASE r + 1 
if ikP = and j G M 1 "- 1 for all j <£ (D U F), 

then queue "r + 1" and decide r mod 2 and goto PHASE r + 2 

Figure 3.2: The main algorithm of process i, performed at every step. Initially, 
a process is in phase with M r = A r = for all r. 



We prove here a few basic lemmas about the main algorithm with respect to any f- 
admissible execution. The first two lemmas affirm two expected properties that held for the 
synchronous algorithm. 

Lemma 3.3 If some nonfaulty process decides in phase r > then no process decides in 
phase r + f . 

Proof: Let i be a nonfaulty process that decides in phase r and consider any other process j . 
According to the code of the main algorithm, j cannot decide in phase r + f without receiving 
an r message from i or adding i to Fj or D r We claim that neither can happen before j 
receives an r + f message from z, which according to the code (which requires M r+1 ^ 0) 
precludes j from deciding phase r + f . First, because i is nonfaulty, by Lemma 3.2 it is never 
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added to F r Process i may send an r message, but only after sending an r + 1 message. 
Because i is nonfaulty, it does not omit this message, and because messages are delivered in 
the order sent, j does not receive it before receiving the r + 1 message. Process i is added to 
Dj only when j receives the message "I've decided" from i. By the same argument, j does 
not receive "I've decided" from i before receiving the r + 1 message. ■ 

The following definition is useful in proving correctness and analyzing time complexity. 
Definition 2 Phase r is quiet if there is some process that never receives any r messages. 

Lemma 3.4 If a nonfaulty process decides in phase r > then phase r + 2 is quiet. 

Proof: By Lemma 3.3, no process decides in phase r + 1. If a process does not decide in 
phase r + 1, then it does not send an r + 2 message until it receives one. Therefore, no 
process sends an r + 2 message and in fact no process receives an r message. ■ 

The next two lemmas affirm that the convention of acknowledging r messages works as 
expected — nonfaulty processes are never blocked — and the last lemma states that the failure 
of blocked processes is eventually detected by all processes. 

Lemma 3.5 For any process i and any nonfaulty process j, if i advances to phase r > I 
without deciding and sends an r' message j for < r' < r — 1, then i receives an "ack(i } r — 1)" 
message from j . 

Proof: By induction on r. Clearly the lemma is true for r = 1: j advances to phase I 
during its hrst step and sends "ack(z,0)" during the next step at which it has received a 
message from i. 

Assume the lemma is true for r — 1 > 1. First observe that j does not decide in any 
phase r' < r — 3: by Lemma 3.3, this would imply that no process decides in phase r' + I and 
therefore no process sends an r' + 2 message, but this is not possible because i advances to 
phase r > r' + 3 without deciding and therefore must receive an r' + 2 message. If j decides in 
phase r' and r' = r — 2 or r — 1, then j immediately advances to phase r' + 2 > r after deciding 
and sends "ack(z, r — I)" to i. Suppose that j does not decide in any phase r' < k — 1. Process 
j must advance from each phase r' < k — I because it is never shut down, has M r - ^ 0, and 
has \Aj | > n — f: j is never shut down by Lemma 3.2; j has M r - ^ because it receives 
an r' message from z; j has \A r - \ > n — f because it is nonfaulty and therefore sends an 
r" message to all processes for each r" < r' — I and by the induction hypothesis receives 
"ack(j,r")" from all nonfaulty processes — none of which, by Lemma 3.2 are ever added to 
Fj. Process j therefore advances to phase r and may then send "ack(z,r — I)" to i. ■ 
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Corollary 3.6 If process i is nonfaulty and advances to phase r > 1 without deciding, then 
it eventually has \A r i ~ 1 \ > n — f . (A nonfaulty process is never blocked.) 

Proof: Because i is nonfaulty and advances to phase r without deciding, for < r' < r it 
sends an r' message to all processes as it advances to phase r' + 1. By Lemma 3.5, i receives 
"ack(z, r — 1)" from each nonfaulty process. Because by Lemma 3.2, nonfaulty processes are 
never added to Fi, each nonfaulty process is added to A r ~ , giving the necessary bound. ■ 

The following lemma relies on the fact that a process continues to take steps, executing 
the algorithm after it decides; in particular, it continues to detect the failure of processes 
and, if necessary, send acknowledgments. 

Lemma 3.7 If a faulty process j unsuccessfully broadcasts an r message at time t and is 
subsequently blocked in phase r + 1, then all processes not shut down by time t + C(d-\- c 2 ) + 
2(g? + c 2 ) ~ t + C d + 2g? detect the failure of j by that time. 

Proof: By the definition of being blocked, j advances to phase r + 1 but never has \A r A > 
n — f. Thus there is some nonfaulty process i never added to A r -. By Lemma 3.5, j omits 
an r' message to i for some < r' < r. This omission occurs at or before time t. By 
Lemma 3.1, i detects this failure by time t + C(d + c 2 ) + ( d + c 2 ), broadcasting "shutdown 
j" to all processes in the same step. By time d + c 2 later, all processes not yet shut down 
have received this message and taken a step, adding j to their failed sets. ■ 



3.4 Correctness proof 

We now prove that in all /-admissible executions, the algorithm terminates and correctly 
satisfies the agreement and validity conditions. We hrst prove "progress" — that processes 
in fact advance to successive phases as expected. Given this progress lemma and a few 
simple facts about quiet phases, the proofs of agreement, validity, and termination are easily 
derivable. These proofs follow the same reasoning as the informal argument about the 
synchronous algorithm outlined in Section 3.2. 

Lemma 3.8 For each r > and each process i that is neither blocked nor shut down in any 
phase r' < r, process i either decides in some phase r 1 < r or advances to phase r + 1. 

Proof: For contradiction, let phase r be the hrst phase for which the lemma is not satisfied 
and let i be any process for which the lemma is not satisfied at phase r. By the choice of r, 
i advances to phase r. 
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First note that r ^ 0, since every process either decides or advances to phase 1 during 
its hrst step. 

We show below that for r > and for every process j, either i either receives an r — 1 
message from j or adds j to Fi or D{. We thus derive a contradiction by concluding that i 
may either decide or advance to phase r + 1, since it has j £ M T ~ X for all j £" (D 8 - U i^) and 
by assumption is not shut down and eventually has |A[ _1 | > n — f (is not blocked in phase 
r). 

Let j be any other process. First consider the case that j also is neither shut down nor 
blocked in any phase r' < r and that further, j does not fail directly to i. By the choice of 
r, j either advances to phase r or decides in a previous phase. If j advances to phase r, then 
it must send an r — 1 message to i (successfully, in this case). It cannot be that process j 
decides in phase r — 1, since that would imply sending an r message to z, thus enabling i 
to advance immediately to phase r + 1, contradicting our original assumption. If j decides 
before phase r — 1, then it sends an "I've decided" message to i and is added to D{. 

Now consider the case that j is either shut down or blocked in some phase r' < r or j 
fails directly to i. If j is blocked, then by Lemma 3.7, i will eventually detect that j is faulty. 
Similarly, if j is shut down, then it halts and i will detect its failure by timeout. Lastly, 
Lemma 3.1 ensures that if j fails directly to i and i is not shut down, then i eventually 
detects j as faulty and adds it to Fi. ■ 

Corollary 3.9 For any r > ; every nonfaulty process either decides in phase r' < r or 
advances to phase r + 1. 

Proof: By Lemmas 3.2 and 3.6, a nonfaulty process is never shut down or blocked; the 
corollary then follows immediately from Lemma 3.8. ■ 

Corollary 3.10 If phase r > is quiet, then each nonfaulty process decides in some phase 
r' < r . 

Proof: By Corollary 3.9, each nonfaulty process either decides in phase r' < r or advances 
to phase r + 1. But a nonfaulty process cannot advance to phase r + 1: to do so, it would 
send an r message to all processes, contradicting the assumption that phase r is quiet. ■ 

Lemma 3.11 (Agreement) No two nonfaulty processes decide on different values. 

Proof: Let r be the hrst phase in which some nonfaulty process i decides. By Lemma 3.3, 
no process decides in phase r + 1. Because no process decides in phase r + 1, no process 
sends an r + 2 message and thus phase r + 2 is quiet. Thus by Lemma 3.10, all nonfaulty 
processes decide in some phase r' < r + 2. By the choice of r, all nonfaulty processes decide 
in either phase r or phase r + 2, in either case on r mod 2. ■ 
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Lemma 3.12 (Validity) If any process decides on value b, then some process i starts with 
v % = b. 

Proof: Clearly if some process j decides on 1, it does so in phase r > and that process 
itself must have started with Vj = 1 since otherwise it would have decided on during its 
hrst step. 

If some process j decides on 0, it cannot be that all processes started with V{ = 1. 
For then, no process would decide in phase and no process would send a 1 message. No 
process would receive a 1 message and therefore no process would advance to phase 2 without 
deciding and so no process would decide 0. ■ 

Lemma 3.13 (Termination) In any f-admissible execution, there is a quiet phase num- 
bered at most / + 2 and so each nonfaulty process decides in some phase r < / + 2. 

Proof: If some nonfaulty process decides in phase r < f then no process decides in phase 
r + 1 and no process sends an r + 2 message. Phase r + 2 is therefore quiet and by Lemma 3.10 
all nonfaulty processes decide by phase r + 2 < / + 2. 

If no nonfaulty process decides in any phase r < /, then there must be a phase h, 
< h < /, in which no faulty process decides, and therefore in which no process decides. 
If a process does not decide in phase h, then it does not send an h + 1 message until it 
receives one. Therefore no process sends an h + 1 message — phase h + 1 is quiet — and by 
Lemma 3.10, all nonfaulty processes decide by phase /j + 1 < / + 1. ■ 
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3.5 Analysis of time bounds 

We now bound the amount of real time until all nonfaulty processes decide in any f- 
admissible execution. The analysis in this section is carried out with respect to any given 
/-admissible execution. Having already proved the correctness of the algorithm, we will here- 
after assume d ^> c 2 and make approximations appropriately. We hrst establish the tools for 
our analysis and then conclude with the nearly optimal bound for n > 2/ + 1 (Section 3.5.1) 
and two bounds for n < 2f (Section 3.5.2). We hrst introduce some notation. 

• For r > 0, let t r be the earliest time by which all processes not blocked in any phase 
r' < r of the execution have either decided, advanced to phase r + 1, or been shut 
down. 

Because every process either decides or advances to phase 1 on its hrst step, t = 0. 

• Let phase h be the hrst (smallest numbered) phase that is quiet. 

• For r > 0, let B r = {i : i is blocked in phase r + 1}; let b r = \B r \. 

The definition of B r may seem unusual, but makes sense on closer analysis. We will want 
to bound t r — t r _i, which we think of as the time for phase r, in terms of the number of 
processes that omit an r message to all nonfaulty processes. This number is h, n since all such 
processes are subsequently blocked in phase r + 1. 

Lemma 3.14 For r ^ r' , B r C\B r , = $. 

Proof: By definition, a process must advance to phase r' in order to be blocked in phase 
r' . If r < r' and i £ B, n then i is blocked in phase r + 1 < r' and cannot advance to phase 
r + 2 < r' + 1 or greater. Therefore, i is not blocked in phase r' + 1 and cannot be in B r i. ■ 

Corollary 3.15 £1=2° k < f- 

Proof: By Corollary 3.6, a nonfaulty process is not in any B, n so these sets consist of faulty 
processes only. The bound of / then follows immediately from the disjointness of the sets 
B, n from Lemma 3.14. ■ 

We prove our upper bound by summing the times of the individual phases. We will say 
"the time of/for phase r" to mean t r — t r _\. We prove an upper bound for two kinds of phases: 
those that are quiet and those that are not. We hrst derive some useful lemmas about the 
receipt of acknowledgments. We then prove an upper bound on the time to complete any 
phase — in particular, quiet phases. We then prove a lemma (Lemma 3.19) that is at the 
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heart of the timing analysis, regarding causal chains of r messages. The time for phases that 
are not quiet depends on whether or not n > 2/ + 1 and will be deferred until the following 
subsections (Sections 3.5.1 and 3.5.2), where we will also sum over the phases to derive the 
total time bounds. 

We hrst prove a useful lemma about the timeliness of acknowledgments: if a process 
receives a sufficient number of properly sequenced acknowledgments for its r message, then 
it receives them promptly, by time t r _ 1 + 2d. 

Lemma 3.16 For r > ; if process j eventually has \A r A > n — f , then it has \A r A > n — f 
by time t r + 2d. 

Proof: Process j sends an r message either as it advances to phase r + 1 or as it decides in 
phase r — 1. If process j broadcasts its r message because it advances to phase r + 1, then 
it is clearly not blocked in any phase r' < r and is neither decided nor shut down before 
it broadcasts this message, and so broadcasts it by time t r . Similarly, if j broadcasts its 
r message because it decides in phase r — 1, then it does so by time t r _ 1 . In either case, 
j broadcasts its r message by time t r and any process that receives an r message from j 
receives it by time t r + d. 

Consider any process i £ A r -. We claim that i sends "ack(j,r)" by time t r + d. By the 
fact that it sends "ack(j, r)" eventually, process i must advance to phase r + 1 or greater 
(either by deciding in phase r — 1 or phase r or by advancing to phase r + 1 without deciding) 
before sending "ack(j,r)". It follows that i is neither blocked in any phase r 1 < r nor shut 
down before it does so and therefore advances to phase r + 1 by time t r . By time t r + d } i 
also receives an r message from j and therefore sends "ack(j, r)" by then. ■ 

Corollary 3.17 For r > ; if process i sends an r + 1 message after time t r or for some j 
process i sends "ack(j } r)", then i has \A r A > n — f by time t r + 2d. 

Proof: If process i sends an r + 1 message after time t r then it does not send the r + 1 
message as a result of deciding in phase r, since processes that decide in phase r do so by 
time t r . Therefore i sends an r + 1 message as a result of advancing from phase r + 1, which 
requires that it have \A r A > n — f. By Lemma 3.16, i therefore has \A r A > n — f by time 
t r + 2d. ■ 

We now prove a generous upper bound on the time to complete any phase (in particular, 
quiet phases). The proof is very similar to the proof of progress. 
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Lemma 3.18 t\ — t < Cd + d and for any phase r > 1, 
t r < max(i r _i + Cd + d, t r _ 2 + Cd + 2d). 

Proof: For contradiction, assume that for r > 1 (respectively, r = 1) at time max(t r _i + 
Co? + d, t r _ 2 + Co? + 2d) (resp. time t + Cd + J), some process i has neither decided nor 
advanced to phase r + 1 nor been shut down and is never blocked in any phase r' < r. By 
this time, by the definition of t r _i, i is in phase r, and since it is not blocked in phase r, has 
|^i _1 | > n ~ f by Lemma 3.16. We will reach a contradiction by showing that i must decide 
in phase r by this time because for every other process j, either i receives an r — 1 message 
from j or i detects that j has decided or failed (j £ D 8 - U i^). 

Let j be any other process. First consider the case that j (1) is not blocked in any phase 
r' < r — 1, (2) is not shut down by time t r _i, and (3) does not fail directly to i before or at 
time t r -\. By Lemma 3.8, j either advances to phase r or decides in some phase r' < r — 1; 
by definition it does so by time t r -\. If j advances to phase r, then it sends (successfully, by 
assumption) an r — 1 message to i by time t r -\ and i receives this message by time t r -\ + d. 
If j decides in phase r' < r — 1, then by time t r > + d < t r _i + J, z receives an "I've decided" 
message from j and adds j to D{. 

Now consider the case that j either (1) is blocked in some phase r' < r — 1 (2) is shut 
down by time t r _i, or (3) fails directly to i at or before time t r -\. If j is shut down or 
fails directly to i at or before time t r _i, then by Lemma 3.7, i detects the failure by time 
t r -\ + Cd, + d. Case (1) is not possible for r = 1, so we are finished for that case. If j is 
blocked in some phase r' < r — 1, then because it advances to phase r', j neither decides 
nor is blocked nor shut down in any prior phase. Therefore, by time t r i-\ < t r _ 2 , j advances 
to phase r', broadcasting (unsuccessfully) an r' — 1 message. By Lemma 3.7, all processes, 
including z, detect the failure of j by time t r _ 2 + Co? + 2d. ■ 

In bounding the time of a phase r that is not quiet, we will bound the time until every 
process receives an r message (which every process does, by the definition of a quiet phase). 
By that time, every process that is not yet decided or shut down or blocked in any phase 
r' < r may advance to phase r + 1; thus this is a bound for t r . In bounding the time until 
every process receives an r message, the following reasoning is at the heart of the analysis. 
In order for the first r message to ever be sent, some process must decide in phase r — 1, 
which by definition, it does by time t r -\. An r message sent by any other process i that does 
not decide in phase r — 1 is sent because i received an r message. Thus, a causal chain of 
r messages may be followed and the hrst r message received by any process can be traced 
back to a process that originated it [i^ in the following lemma), sending the "hrst" r message 
before t r -\. Because a process broadcasts an r message as soon as it receives one (at its 
next step, to be precise; also, assuming it has |A r_1 | > n — /, which it does after t r -\ + 2 d 
if at all), our time bound for phases that are not quiet is approximately d times the length 
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of the shortest such chain to each process. We now prove a lemma about the existence of 
such chains and their basic timing properties. This lemma is central to every bound we will 
prove for the omission failures algorithm. 

Lemma 3.19 If phase r is not quiet, then for every process i , there exists a sequence of 
distinct processes z , z'i, . . . , i\. and messages m , mi, . . . , rrik with k > such that 

(1) for < j ' < k, ij sends the first r message, rrij-i, received by ij-i, 

(2) exactly one process, i\., sends an r message by time t r _ 1 , and 

(3) for < j ' < k, process ij sends an r message (raj_i) by time t r _ 1 + (k — j ' + l)d. 

Proof: Phase r is not quiet, so every process ij receives an r message; let rrij be the hrst 
r message that ij receives. Define a sequence of processes z , i\ } . . . inductively as follows: if 
ij sends an r message by time t r _ 1 then define k = j and let ij be the last process of the 
sequence; otherwise, define i 1+ i to be the process that sends rrij. 

We hrst claim that the resulting does not include repetitions and is therefore finite. This 
is clear if i sends an r message by time t r _ 1 (then k = j = 0). If not, we show that for 
any < j ' < k, process ij is distinct from processes z , . . . , ij-i- Only i may fail to send 
an r message. If it does, then clearly it is distinct from the other processes in the sequence; 
if not, then let ra_i be any r message that it sends. If ij sends an r message by time t r _i, 
then clearly it is distinct and we are done. If not, then for all i X} < x < j, because i x 
sends an r message (m x -\) later than time t r _i, i x must send it as the result of receiving 
an r message (by the definition of t r _i, a process that decides in phase r — I broadcasts r 
by time t r _i). It follows that the sending of m x _\ by i x is preceded by the sending of m X} 
the hrst r message received by i x . Because a process broadcasts an r message only once, it 
follows that processes z , . . . , ij are distinct. 

Thus the sequence ik, . . . , Zi, i forms a chain of processes such that for < j ' < k, process 
ij sends the hrst r message, m^-i, received by z\,_i and k is the only process in the sequence 
to broadcast an r message before time t r _ 1 . This proves (I) and (2). 

It remains to show (3), the timing property. For < j < k } the fact that ij sends an r 
message but does not decide in phase r — I implies that ij advances to phase r by time t r _i, 
since it is not blocked in any phase r 1 < r and is not decided or shut down before sending 
mj-ii which it does after time t r _ 1 . Therefore, by time t r _ 1 + 2 d } each ij is in phase r and 
by Lemma 3.16, has |A[ _1 | > n — f . Since ih-\ receives ra^-i by time t r -\ + d, it advances 
to phase r + 1, sending rrik-2 by time t r _ 1 + 2d. Process ik-i receives this message by time 
t r _ 1 + 3d and thus advances to phase r + 1, sending rrik-3 by time t r _ 1 + 3d. Similarly, for 
< j < k } process ij receives rrij and sends raj_i by time t r _ 1 + (I + k — j)d. ■ 

To complete the lemmas necessary to tightly bound the running time, we need only bound 
the time for any phase that is not quiet. This bound depends on whether or not n > 2/ + 1. 
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3.5.1 Bound for n > 2/ + 1 

We show that the algorithm depends on C only to the extent of an additive factor of Cd. 
For C large, this algorithm may be far more efficient that a direct rounds simulation. The 
bound we obtain for n > 2/ + 1 is within approximately a factor of 4 of optimal: our bound 
is 4(/ + 1) d + Cd; the lower bound proved in [ADLS90] is (/ - 1) d + Cd. 

Having bounded the time for quiet phases in Lemma 3.18, we need only bound the time 
for any phase that is not quiet. If n > 2/ + 1, we can be sure that when a faulty process 
broadcasts an r message, it either sends to at least one nonfaulty process or becomes blocked 
in phase r + 1 since / < n — f. If it sends to a nonfaulty process, then that process will send an 
r message to all processes and the phase will end. The number of processes blocked in phase 
r + 1 is exactly b r ; our bound for phase r is roughly b r ■ d. This is the key difference between 
our algorithm and the algorithm of [ADLS90]: a faulty process may cause delay d only if 
it sends exclusively to other faulty processes; the convention of requiring acknowledgments 
ensures that each faulty process can do so only once. 

To reinforce the intuition about this bound, we hrst describe how this bound is realized 
by a worst-case execution: Process 1 £ B r is the hrst to send an r message. It decides in 
phase r — 1 at time t r _ 1 (no later, by definition of t r _i, since process 1 is not blocked in 
any phase r' < r — 1) and sends an r message to only process 2 £ B r . Process 2 waits until 
time t r -i + 2d for \A r 2 ~ | > n — f and then, having received an r message from 1, advances 
to phase r + 1, sending an r message to only process 3 £ B r . The pattern is repeated until 
process b r + 1 £" B r receives an r message at time t r _ 1 + (b r + 1) d. Process b r + 1 advances to 
phase r + 1 and omits an r message to exactly one nonfaulty process, i. All nonfaulty process 
except i receive an r message from b r + 1 at time t r _ 1 + (b r + 2) d and i receives an r message 
from them at time t r _ 1 + (b r + 3) d. By this time, each process has either advanced to phase 
r + 1 (as it sent an r message), decided, been shut down, or is blocked in some phase r' < r. 
This scenario shows where the extra 3d arises: one d is caused by the delay of waiting (by 
process 2 in this scenario) for acknowledgments from the previous phase, another d is for a 
faulty process (here, b r + 1) that is not blocked in phase r + 1 to send an r message to a 
nonfaulty process, and another d is for the remaining nonfaulty processes (here, i) to receive 
an r message. (In [ADLS90], only the last extra d is incurred; this leads to the factor of 2 
in their bound, instead of 4 in ours.) 

Lemma 3.20 For n > 2/ + 1 and r > 1, if for all r' < r phase r' is not quiet, then 
t r — t r _ 1 < (3 + b r )d. 

Proof: We show that by time t r _ 1 + (3 + b r )d, all processes receive an r message. Thus, by 
that time, every process is either decided, shut down, blocked in some phase r' < r, or may 
advance to phase r + 1. 
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By Lemma 3.19, we know that for every process z , there is a sequence of distinct pro- 
cesses, z , i\ } ■ ■ ■ , ik satisfying the three properties of Lemma 3.19. 

Now, if k < b r + 2 then i receives m by time t r _ 1 + (k — 1 + 1) d-\- d < t r _i + (6 r + 2) d-\- d. 
If A; > b r + 2, then there is a j such that k — b r < j < k and ^ ^ i? r . By Lemma 3.19, ij 
sends an r message by time t r + (k — j + l)d < t r _i + (1 + 6 r )J. Because ^ ^ i? r , ij send; 



s 



an r message to at least n — f > / + 1 processes, one of which must be nonfaulty. This 
nonfaulty process, £, receives an r message from ij by time t r _ 1 + (2 + b r )d. 

We now conclude the proof by showing that process £ sends an r message, received by 
all processes, by time t r -\ + (2 + b r )d. Because no phase r' < r is quiet, it follows from 
Lemma 3.4 that £ does not decide in any phase r' < r — 2. If £ decides in phase r — 1, then 
does so, sending an r message, by time t r -\. If i decides in phase r or advances to phase r 
without deciding, then it does so by time t r -\ and subsequently sends an r message once it 
receives one and has \A' r f ~ | > n — f, which, by Lemma 3.16, it does by time t r _i + (2 + b r ) d. ■ 

We can now bound tightly the running time of any /-admissible execution by summing 
the bounds for all phases in that execution. 

Theorem 3.21 For n > 2/ + 1, the algorithm above solves the consensus problem for f 
omission failures within time 4(/ + 1) d + Cd. 

Proof: For any given execution, let h be the hrst quiet phase. By Lemma 3.10, each 
nonfaulty process decides in some phase r < h } by time t^. If h = then by Lemma 3.10, 
each nonfaulty process decides in phase in its hrst step and the running time is 0. If h = 1 
then by Lemma 3.10, each nonfaulty process decides in phase 1 or 0, and by time t^ by 
Lemma 3.18 the running time is ti — t < Cd + d. 

If h > 1, then we can bound the time for phases 1, . . . , h — 1 by Lemma 3.20, and the 
time for phase h by Lemma 3.18. Thus we have 

h-i 

th — to = / X~tr — U-l) + (tfi — th-l) 
r = l 
h-1 

< J2( 3 + b r) d +( Cd + d ) (by Lemmas 3.20 and 3.18) 

r = l 

< (f + l)3d + f -d+iCd + d) (by Lemma 3.13 and Cor. 3.15) 
= 4(f + l)d + Cd. 



For C > 4, it is possible to construct an execution that takes exactly time 3d + 4(/ — 
3) d-\- 3g?+ Cd-\- d. In this execution, the hrst phase takes time 3 d } the following / — 3 phases 
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take time 4o?, the penultimate phase takes 3d and and the last phase takes time Cd + d. 
Each of the phases taking Ad develops when all processes receive an r — 1 message at time 
t r -\ and all but one, p r -i } advances to phase r. Process p r -i decides on r — 1 mod 2 at t r _ 1 
(before it receives the r — 1 messsage) and sends an r message to exactly one other process, 
p r +i, which receives its acknowledgments for its r — 1 message at time t r _ 1 -\-2d and sends an 
r message to exactly n — f processes. By time t r _ 1 + 4o?, all processes receive an r message 
and, except for one process, p r} advance to phase r + 1. In the following phase, at time 
t r + 4g?, process p r +i decides (the processes to which p r +i omitted an r message run slowly 
and do not detect its failure until t r _ 1 + 2d + (Cd + d) = t r _ 1 + 7d = t r + 3o?, so it is not 
shut down before then). Remaining details are left to the reader. 

3.5.2 Bounds for n < 2f 

When n < 2/, we are able to bound the running time of the algorithm in two ways, yielding 
one expression that depends on the ratio — ^-r and another expression that depends on the 
square root of C. We will use Lemmas 3.14 (B r D B r i = 0), 3.16 (the timeliness of acknowl- 
edgments), 3.18 (the time for any phase), and 3.19 (sequences of causal r messages), and 
Corollaries 3.15 (the sum of the b r ), and 3.17 (also regarding acknowledgments) the proofs 
of which did not rely on the relative values of n and /. 

Bound dependent on — ^-r 

r n-f 

This bound requires a lemma about the length of causal sequences of r messages more 
complicated than Lemma 3.19. Processes not in B r must send an r message to n — f 
processes but not necessarily to a nonfaulty process. We therefore are not able argue as 
for n > 2/ + 1 that phase r ends very soon after a process not in B r sends an r message. 
Nevertheless, disregarding processes in B r for the moment, if it were true that a process 
could not get an acknowledgment from another process that already sent an r message, then 
it would take at most time (^j)d before a nonfaulty process received an r message. Our 
algorithm does not exactly enforce this restriction on acknowledgments, but it does prevent 
a process from using acknowledgments received from a process that previously omitted an r 
message to it. We are thus able to derive a bound of (3^y + b r + 4) d below in Lemma 3.23. 
This argument is most easily made by considering a directed graph on the faulty processors. 
Accordingly, for a given execution, define 

• directed graph G( = (V/,Ef) where 

V/ = {all processes that fail during the given execution}. 
El = {(i,j) '■ i sends an r message to j; i,j£ V/; i ^ j}. 
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• <5/(z, j) = length of the shortest path in G{ from i to j, where j,j G V/ . 

• 5'* r ~ 1 = {i : i sends an r message by time t r _ 1 ]. 

• S^ = {i : i sends an r message to a nonfaulty process}. 

Claim 3.22 If phase r is not quiet and no nonfaulty process decides in phase r — 1, then 
there exist faulty processes a £ 5'* r ~ 1 and 7 £ S^ such that there is a path in G{ from a 

to 7. 

Proof: Let 7 be the hrst process to send an r message to a nonfaulty process. Process 7 
must be faulty: by the choice of 7, no process sends an r message to a nonfaulty process 
earlier than 7 sends an r message and therefore if 7 is nonfaulty then no process sends 
an r message to 7 before it sends its r message. Therefore 7 must decide in phase r — 1, 
contradicting our assumption that no nonfaulty process decides in phase r — 1. We conclude 
that 7 £ S™f . Note that 7 sends an r message before any nonfaulty process does. 

Let Cf be the nodes in G{ from which node 7 is reachable (including 7) and let a be 
the process such that no process in Cf sends an r message before a does. It follows that a 
sends an r message no later than 7 does. Because no nonfaulty process sends an r message 
before 7 does, a does not receive an r message from a nonfaulty process before sending its 
r message. By choice, a does not receive an r message from any faulty process before it 
sends its r message. We therefore conclude that a receives no r messages before sending its 
own, and therefore must decide in phase r — 1, sending an r message by time t r _ 1 (by its 
definition). ■ 



Lemma 3.23 For r > 1, if for all r 1 < r phase r' is not quiet, then 

n-f 



U ~ U-X < (3-^t + K + 4) d. 



Proof: We show that by time t r -\ + (3^r + b r + 4) d, every process receives an r message. 
Thus, by that time, every process that is never blocked in any phase r 1 < r and is neither 
decided nor shut down at that time, has |A r_1 | > n — f by Lemma 3.16 (because it is not 
blocked in phase r) and therefore may advance to phase r + 1. 

First note that if a nonfaulty process decides in phase r — 1, it does so by time t r _i, 
broadcasting an r message that is subsequently received by all processes by time t r _ 1 + d } 
and the lemma is proved. So we consider the case that no nonfaulty process decides in 
phase r — 1. 

Lemma 3.22 applies for this case: it says that there exist processes i £ S'* r ~ 1 fl V/ and 
7 £ S™f fl Vj such that j is reachable from i in G{ . Let a £ 5'! r ~ 1 and 7 £ S n J be a closest 

u 1 1 </ / I>I 

pair of nodes in G{: 

H( a ->l)= mi 11 Hihj)- 

ies r r 1 nv/ 
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We first bound the time by which 7 broadcasts its r message. Fix a minimal length path 
from a to 7 and let f3 be the last process on that path that sends an r message by time 
t r -i + 2d. We claim that 7 broadcasts its r message by time t r -\ + (£/(/?, 7) + 2) d. By 
the choice of f3, each process j on the path from f3 to 7 sends an r message later than time 
t r -i + 2d and therefore by Corollary 3.17 has \A r ~ | > n — f by time t r -\ + 2d . By time 
t r -\ + 3 d, the process on this path after f3 receives an r message (from f3) and thus broadcasts 
an r message. Similarly, for j ' > 1, the j th process on the path after f3 sends an r message 
by time t r -\ + 2 d + jd and process 7 sends an r message by time t r -\ + (2 + <5/(/3, j))d. 

We now show that by time t r -\ + ( <5/ ( /5 , 7 ) + 4) d, all processes receive an r message. A 
nonfaulty process rj receives an r message from 7 by t r -\ + (3 + <5/(/3, j))d. Because no phase 
r' < r is quiet, it follows from Lemma 3.4 that rj does not decide in any phase r' < r — 2. If 
T) decides in phase r — 1, then it does so by time t r _ 1 , sending an r message as it does; all 
processes receive it by time t r _ 1 -\-d. If rj advances to phase r without deciding, then it does so 
by time t r -\. By Corollaries 3.6 and 3.17, rj has \A r ~ 1 \ > n — f by time t r -\ + (3 + <^/(/3, j))d. 
By this time, rj has received an r message from 7 and therefore if r\ has not yet sent an r 
message — if r\ has not yet advanced from phase r or has decided in phase r and advanced to 
phase r + 2 but not yet sent an r message — it may then send an r message. An r message 
is then received by all processes by time t r -\ + (£/(/?, 7) + 4) d. 

To complete the proof, we now show <5/(/3,7) < -^7 + b r . Let k = £/(/?, 7) and let 
Li = {p : S^(f3,p) = i} for 1 < i < k. Define L = {/3} and L-\ = 0. Consider the sum 

fc-i 
o- = Yl l-^-i u Li U L i+ i\. 

i=0 

Since the sets Li are disjoint, each node in G{ is counted at most 3 times, socr < 3|G/| < 3/. 

Claim 3.24 For < i < k — 1, at least k — b r of the sets L{-\ U Li U Li + \ has cardinality at 
least n — f . 

Proof: Clearly, for i < k — 1, no set Li is empty, since 7, at distance k from /3, receives 
an r message from a faulty process. At least k — b r sets Li contain a process j ^ B r such 
that j is on the path from f3 to 7. For each j, and each process £ in AJ, clearly, j sends 
£ an r message; we will show that if j is on the chosen path from f3 to 7, then process £ 
sends j an r message also. We will also show that if j £ Li where i < k — 1 } then £ is faulty 
and therefore in G( . Thus, for all £ in A r : such that j £ Li, there are edges of G{ in both 
directions between j and £ and if j £" i? r , then |iv 8 _i U Z; U -Zv 8+ i| > n — f, completing the 
proof. 

We first show that if j £ Li where i < k — 1, then £ £ AJ is faulty. If £ were nonfaulty, 
then j would be in S^ . But j cannot be in S^ , since 7, at distance k, was defined to be 
the closest node to a in S^ H V/ but j £ Z; is at distance i < k. 
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We next show that for each j on the chosen path from f3 to 7, if £ £ Aj, then £ sends j an 
r message. Consider hrst the case that £ broadcasts an r message before sending "ack(j, r)" 
to j. Since £ £ Aj, £ does not omit a message to j before sending "ack(j,r)"; in particular, 
it does not omit the r message. Consider then the case that £ does not send an r message 
before sending "ack(j,r)" to j. By the choice of /3, j sends its r message (and £ receives it) 
later than time t r _ 1 + 2d. Because £ sends an "ack(j, r)" message, £ either decides in phase 
r — 1 or advances to phase r without deciding. However, £ cannot decide in phase r — 1: 
processes that decide in phase r — 1 do so, sending an r message by time t r _i, but we are 
assuming £ does not send an r message before sending "ack(j,r)", which is later than time 
t r -\ + 2d. Thus, at the time that £ receives the r message from j, £ is either in phase r, 
not yet having sent an r message, or decided and in phase r + 2, not yet having sent an r 
message. In its hrst step after receiving the r message from j, process £ queues both "r" and 
"ack(j,r)". Because £ £ Aj, this message is not omitted, so £ sends an r message to j. ■ 

Thus we have (k — b r )(n — /) < a and a < 3/, or <5/(/3,7) = k < -^7 + b r , which 

n-f 



completes the proof: all processes receive an r message by time t r -\ + (—^j + b r + 4) d. 



Theorem 3.25 For n < 2f , the algorithm above solves the consensus problem for f omis- 

n-f 



sion failures within time (3^j + 5)(/ + l)d + Cd. 



Proof: For any given execution of the algorithm in which h is the hrst quiet phase, by 
Lemma 3.10, each nonfaulty process decides in some phase r < h, by time t^. Again, if 
h = then by Lemma 3.10, each nonfaulty process decides in phase in its hrst step and 
the running time is 0. If h = 1 then by Lemma 3.10, each nonfaulty process decides in phase 
1 or 0, and by time t^ by Lemma 3.18 the running time is t\ — t < Cd + d. 

If h > 1, we can bound the time for phases 1, . . . , h — 1 by Lemma 3.23, and the time for 
phase h by Lemma 3.18. Thus we have 



h-i 



tfi — to = y Xtr — t r -l) + (th — th-l) 
r = l 

h-1 

< J2( 4: + 3 lif+ b r) d +( Cd + d ) (by Lemmas 3.23 and 3.18) 



r = l 



< (/ + l)(4 + 3^j)rf + /- d+(Cd + d) (by Lemma 3.13 and Cor. 3.15) 

= (f + l)(5 + 3^j)d + Cd. 
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Bound dependent on \/C 

The analysis of the previous section shows that the running time of our algorithm is 

Note that if / is close to n, say n = f + 1, then the bound is roughly proportional to f 2 d } 
which is no improvement on the algorithm of [ADLS90]. However, for these proportional 
values of n and /, we are better able to bound the running time. The analysis of this section 
shows that the running time of our algorithm is also bounded by 



{2VC + 6){f + l)d+Cd. 

So when C l\/C + 6 < 5 + 3^r, or roughly n < f + f/y/C, we have a better bound on the 
running time. 

Define a partition the hrst r phases of the given execution into two classes according to 
their length: 

X r = {p : t p — tp-i < VC ■ d and p < r} = {short phases} 
Y r = {p : p $ X r and p < r} = {long phases}. 

and define 

S* = {i '■ i omits an r message to a nonfaulty process after time t r _i}. 

We can bound the short phases by their defined bound, but bound the long phases by 
chains of r messages, via the following two lemmas. 

Lemma 3.26 If phase r > 1 is not quiet then either t r < t r -\ + (\S-[.\ + 3) d or 
all nonfaulty processes decide by this time. 

Proof: We once again show that by time t r -\-\-(\S-[.\ + 3) J, all processes receive an r message 
and thus by this time are either decided, shut down, blocked in some phase r' < r, or may 
advance to phase r + 1. 

By Lemma 3.19, we know that for any process z , there is a sequence of distinct processes 
z , i\ } ■ ■ ■ , ik such that k is the only process in the sequence to broadcast an r message before 
time t r _i, and for < j ' < k, by time t r _ 1 + (1 + k — j)d, ij sends the hrst r message, m^-i, 
received by ij-\. 

Now, if & < \Sl\ then i receives m by time t r -\ -\-(k— l-\-l)d-\-d < t r -\ -\-(\Sl\-\-l)d. If 
k > \S£\, then there is a j such that < k — \S^\ < j < k and ij ^ Sf. Process ij therefore 
sends an r message to all nonfaulty processes by time t r -\ -\-{l-\-k — j)d < t r -\ + (1 + | -S'/ 1 )c/. 
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If some nonfaulty process has not yet decided, then it sends an r message to all other 
processes by time t r -\ -\- (2-\-\S-j.\)d and process i receives an r message by time t r -\ + (3 + 
\S r f I) d. m 

The key observation is that a process cannot fail to a nonfaulty process in many long 
phases: 

Lemma 3.27 For any execution of the protocol taking at least <f> phases and for any process 
j , there are at most \/C + 3 phases p £ 1^ such that j £ S- \. 

Proof: If j omits an k message to a nonfaulty process at time t then by Lemma 3.1, that 
nonfaulty process detects j's failure by time t + Cd + d, broadcasting "shutdown j" at that 
time. We have t < tk, and so j is shut down by time t + Cd + 2d < tk + (vC + 2)\fCd. It 
follows that there are at most \/C + 2 long phases i such that tk < tf < Cd + 2d. Thus j 
cannot attempt to send (and cannot omit) an £ + 1 message after time tf and is therefore 
not in S^ for any p > L ■ 

Theorem 3.28 For n < 2f , the algorithm above solves the consensus problem for f omis- 
sion failures within time {2\fC + 6)(/ + \)d + Cd. 

Proof: Let phase h be the hrst quiet phase. Again, if h = then by Lemma 3.10, each 
nonfaulty process decides in phase in its hrst step and the running time is 0. If h = 1 then 
by Lemma 3.10, each nonfaulty process decides in phase 1 or 0, by time t^ by Lemma 3.18 
the running time is t\ — t < Cd + d. 

If h > 1, we consider two cases. Consider hrst the case that not all nonfaulty processes 
decide in phase h — 2. We bound the length of the short phases by their defined length. 
We bound the length of the long phases by Lemma 3.26 and then sum the sizes of Sjj using 
Lemma 3.27. The length of phase h is bounded by Lemma 3.18. Thus we have 

th — to= 2_^ {t p — tp-i) + 2_^ {tp — tp-i) + {th — th-l) 
pex h _! p&n-i 

< \Xh-i\- VCd+d J2 ( 3 + \ sf P \) + ( Cd + d ) (by Lemmas 3.26 and 3.18) 

p&n-i 

< \X h _ 1 \-^d + 3\Y h _ 1 \d + f(^+3)d+(Cd + d) (by Lemma 3.27) 

< (f + l)^d + 3(f + l)d + f(^ + 3)d + Cd + d (by Lemma 3.13) 

< (2^/C + 6)(f + l)d + Cd. 
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Now consider that case that all nonfaulty processes decide in phase h — 2. The running is 
then bounded by th-i — to'. 

th-2 — to = 2_^ (t p — tp-i) + 2_^ (tp — tp-i) 

P£X h _ 2 P&h-2 

< \X h - 2 \-VCd+ ^2 (3 + \S f p \)d (by Lemma 3.26) 

P&h-2 

< \X h _ 2 \-^/Cd + 3\Y h _ 2 \d + f(^/C + 3)d (by Lemma 3.27) 

< f^/Cd + 3fd + f^/Cd + 3fd (by Lemma 3.13) 
= (2VC + 6)fd 
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Chapter 4 

Consensus in the Presence of 
Byzantine Failures 



In this chapter we present a simulation algorithm using 3/ + 1 processes and tolerating / 
Byzantine failures. The algorithm simulates any synchronous round-based algorithm tolerant 
of / Byzantine failures and uses time r(d + 2Cd) + Cd, where r is the number of rounds 
required by the synchronous algorithm. 

The simulation works by keeping processes loosely synchronized to ensure that a nonfaulty 
process does not advance to round r until it has received a round r — 1 message from every 
nonfaulty process. The partial synchronization works by using a combination of two criteria 
for advancing to further phases, one based on elapsed local time and the other based on 
messages received. A similar technique is used in [WL88] to initiate new rounds of clock 
resynchronization. In particular, our criteria for ending round 1 is essentially the same as 
the criteria used in [WL88] for ending every round; our criteria for subsequent rounds is 
different . 



4.1 The simulation algorithm 

The algorithm simulates a synchronous algorithm by ensuring that each nonfaulty process 
receives all round r messages of the synchronous algorithm from all other nonfaulty processes 
before advancing to round r + 1. We do not explore here the formal semantics of "a correct 
simulation" ; rather we regard as sufficient the following correspondence ensured by the above 
property: For every execution of the simulation, there is an execution of the round-based 
synchronous algorithm in which the nonfaulty processes receive the same vector of messages 
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from each other at each round. Since the behavior of faulty processes is not restricted, clearly 
the corresponding synchronous execution is legal. 

Therefore, for the purposes of simulation, we define a synchronous algorithm by its mes- 
sage function only, suppressing information about the state of the synchronous algorithm. 
Let Af 8 -(r, V r ~ 1 ) denote the vector of messages to be sent in the synchronous algorithm by 
process i in round r when the ordered set of messages V r ~ 1 is received in round r — 1 (of 
course, this message function may also depend on the state of the process; we leave this 
implicit). Without loss of generality, assume each process sends a message to all processes 
at every round of the synchronous algorithm. 

Recall that we assume all processes begin executing the algorithm at the same time. At 
each step, a process increments a counter s (initially 0) and executes the code in Figure 4.1, 
explained below. A local variable, initially 1, keeps track of the ROUND number. Ordered 
set V r contains the r f message received from each process. We refer to the r f message sent 
by a process as a "round r" message. (Recall that we assume each process sends a message 
to all processes in every round of the synchronous algorithm.) 

Each process hrst sends its round 1 message and then waits for at least time d to ensure 
that it receives a round 1 message from every other nonfaulty process. When it can be sure 
that time d has elapsed, it advances to round 2 and broadcasts its round 2 message based 
on the round 1 messages it has received so far. It ensures that time d has passed by either 
waiting for &jc\ of its own steps or by receiving / + 1 round 2 messages — this ensures that 
some nonfaulty process has waited at least time d. 

In subsequent each round r, a process waits for at least time 2d (actually 2 d + 3c 2 ) after 
at least / + 1 nonfaulty processes have sent a round r message. By this time, all nonfaulty 
processes must have received at least / + 1 round r messages and therefore advanced to 
round r and sent a round r message. At this time, a process advances to round r + 1 and 
broadcasts its round r + 1 message. Again, there are two ways for a process to deduce 
that sufficient time has passed: if it takes (2 d + 3c 2 )/ci steps after receiving at least 2/ + 1 
round r messages or if it receives at least / + 1 round r + 1 messages. The latter ensures 
that some nonfaulty process has advanced to round r + 1 and therefore has already waited 
a sufficient amount of time (at least time 2o? after at least / + 1 nonfaulty processes sent a 
round r message). 



4.2 Correctness 

Let t r be the latest time that any nonfaulty process sends a round r message. Again, we 
assume that all processes begin at the same time (here, t\). We say a process "advances 
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Round 1 Send M(l,-); goto Round V . 
Round 1' If s > d/ Cl or \V 2 \ > f + 1, 
then goto ROUND 2 

ROUND r Send M(r, V -1 ); goto ROUND r'. 

Round r' If |V r | > 2/ + 1, 

then s <— 0; goto ROUND r". 
Round r" If s > (2d + 3c 2 )/ Cl or |V r+1 | > / + 1, 

then goto ROUND r + 1. 

Figure 4.1: The simulation of a synchronous algorithm. At each step, a process 
increments the counter s and executes the code according to its 
present round number. V r is the ordered set consisting of the 
r th message received from each process. M(r, V r ~ 1 ) denotes the 
message function of the synchronous algorithm for round r. 



to round r" when it executes the "goto ROUND r" statement in the code. In order to 
prove correctness, we must show that a nonfaulty process eventually advances to all rounds 
required by the synchronous algorithm and always receives a round r message from all 
nonfaulty processes before advancing to round r + 1. 

Lemma 4.1 Each nonfaulty process advances to all rounds required by the synchronous al- 
gorithm. 

Proof: By induction on the round number. Clearly each nonfaulty advances to round 2 — it 
advances to round 1' after its hrst step and advances to round 2 after at most 1 + &jc\ more 
steps. 

For r > 2, assume all nonfaulty processes have advanced to round r. Then all nonfaulty 
processes have sent a round r message and advanced to round r' . Since n — f > 2/ + 1, 
there are at least 2f-\- 1 nonfaulty processes, so each nonfaulty process eventually receives at 
least 2/ + 1 round r messages and advances to round r" . After at most (2d + 3c 2 )/ci steps 
in round r", each nonfaulty process advances to round r + 1. ■ 

Lemma 4.2 No nonfaulty process advances to round r + 1 before receiving a round r message 
from each nonfaulty process. 
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Proof: By induction on the round number r. 

r = 1: Each nonfaulty process takes more than &jc\ steps before advancing to round 2. 
Thus each advances later than time t\ + d, which is after the round 1 message of each 
nonfaulty process, sent at time ti, is delivered. 

r > 1: Assume the lemma is true for r — 1 (i.e., no nonfaulty process advances to round 
r before receiving a round r — 1 message from each nonfaulty process). We show the lemma 
is true for r. Let i be the hrst correct process to advance to round r + 1 and let r 8 - be the 
time at which i advances to round r" (by Lemma 4.1, this time is well-defined). We make 
the following series of deductions about the events that occur at or before the listed times: 

Ti : Because i is in round r", by the induction hypothesis, i has received 
a round r — 1 message from all nonfaulty processes. Because i has 
advanced to r", it has received at least 2/ + 1 round r messages. 

Ti + d : All nonfaulty processes are in round (r — 1)' or greater (because they 
have each sent an r — 1 message to i) and have received at least 2/ + 1 
round r — 1 messages (from each other). 

Ti + d + c 2 : All nonfaulty processes therefore advance to round (r — 1)". 

Ti + d + 2c 2 : All nonfaulty processes have received at least / + 1 round r messages 
(from the nonfaulty subset of processes that sent round r messages to i) 
and therefore advance to round r. 

Ti + d + 3c 2 : All nonfaulty processes send a round r message and advance to round r' . 

Ti + 2 d + 3c 2 : All processes receive a round r message from each nonfaulty process. 

Because by choice i is the hrst nonfaulty process to advance to round r + 1, it follows 
that i advances to round r + 1 only after (2 d + 3c 2 )/ci steps in round r", which occurs later 
than time r 8 - -\-2d -\- 3c 2 . We conclude that i receives a round r message from each nonfaulty 
process before advancing to round r + 1. Since all nonfaulty processes advance to round r + 1 
after z, they also receive a round r message from all nonfaulty processes before advancing. ■ 



4.3 Analysis of time bounds 

Again, we assume d ^> c 2 and therefore approximate d + c 2 by d in the timing analysis. 



43 



Lemma 4.3 t 2 — t\ < Cd and, for r > 2, t r+ i — t r < d + 2Cd. 

Proof: Clearly, t 2 < t\ + Cd. By time t r all nonfaulty processes send a round r message 
and advance to round r' . Therefore by time t r + d } all nonfaulty processes receive at least 
2/ + 1 round r messages and advance to round r" . Within another time 2Cd } all nonfaulty 
processes have taken (2d + 3c 2 )/ci steps and advanced to round r + 1, sending an r + 1 
message. ■ 



Theorem 4.4 There is an algorithm using 3/ + 1 processes which solves the consensus 
problem for f Byzantine failures within time Cd + f(d + 2Cd) = fd + (2/ + l)Cd. 

Proof: Any (/ + l)-round synchronous algorithm can be simulated. Agreement and validity 
follow from correct simulation. Termination follows from Lemma 4.1. The time bound 
follows from Lemma 4.3. ■ 
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Chapter 5 

Bounded-capacity Message Links and 
Failure Detection 



In fault-tolerant distributed algorithms, a common primitive for detecting failures is to "time 
out" failed processors. If processors fail by simply stopping, then a failure may be detected 
by the absence of messages from a processor. In this chapter, we consider how quickly such 
failures can be detected in our semi-synchronous model. 

If it is assumed that all messages sent are delivered within time d of when they are sent, 
then the following simple protocol minimizes the time between any failure and its detection. 
(This is the strategy employed in the algorithm of [ADLS90] and our algorithm of Chapter 3.) 
Each processor broadcasts a message at every step that it takes. If no message is received 
from another processor for more than (d + c 2 )/ci local steps, that processor is declared 
faulty. Because local steps are separated by at least time ci, at least time d + c 2 passes 
before this many steps are taken. Because local steps are separated by at most time c 2 , the 
time between the delivery of any two consecutive messages sent by a processor is at most 
d-\-c 2 . It follows that only failed processors are declared faulty. The maximum time between 
any failure and its detection is approximately Cd + d, occurring in the following scenario: 
processor p broadcasts a message at time t and then fails; these messages are delivered at 
time t + d; every other processor runs slowly (its steps separated by c 2 ) after t + d, and thus 
p's failure is not detected until time t + d + c 2 (d + c 2 )/ci ss t + d + Cd. 

Although the above protocol guarantees minimal delay between any failure and its detec- 
tion, it is clearly inefficient in its use of messages. It takes advantage of the strong assumption 
that all messages are delivered within time d, regardless of the rate at which they are sent. 
In reality, the performance of a message link may suffer if messages are sent too frequently. 
In this chapter, we propose a model of message links with bounded capacity and analyze the 
effect of the capacity bound on the efficiency of detecting stopping failures. 
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5.1 Modeling bounded-capacity links 

Define a message link of unit capacity and delay d as a communication channel that queues 
incoming messages in FIFO order and delivers the hrst message in the queue within time 
d of the later of when the message is sent and when the previous message is delivered. 
(For simplicity, we will assume that message links deliver messages in the order sent. Our 
algorithms do not make use this assumption and our lower bounds hold in spite of it.) For 
positive integer //, define a message link of capacity fi and delay d as the composition of fi 
message links each of unit capacity and delay d/ fi, connected serially so that messages are 
delivered from link i to link i + 1 for 1 < i < fi and link fi delivers messages to the recipient 
process. Messages are neither lost by a link nor delivered out of order, and once a processor 
has sent a message, it cannot cancel that message. 

Thus, in the absence of any other message traffic, the delay of a single message is bounded 
by fi ■ dj fi = d. Note that if a single component link delays all messages by the maximum 
amount, d/ fi, then messages are delivered at a maximum rate of fi messages per time d. In 
particular, it is easy to see that if the last component link delays each message by d/ fi, then 
for any interval of time of length /, at most jj- messages are delivered. 

On the other hand, if no two messages are sent within time d/ fi of each other, then each 
message is delivered within time d of when it is sent. This is easily seen by an induction on 
the number of messages sent. Assume the previous message is delivered by the i th sublink 
within time i ■ dj fi of when it is sent (clearly this is true for the "first" message ever sent). 
If message m is sent at time t, then for < i < fi } by time t + i ■ dj fi the previous message 
has been delivered by link i + 1 and m is delivered by link i (by induction on «). Thus m is 
delivered to the recipient process within time fi ■ dj fi = d of when it is sent. 

For the lower bound, we assume only that in the worst case, a link delivers every pair of 
messages at least time dj fi apart. 



5.2 Timing out failed processors 

We will consider a system of processors fully connected by message links of capacity fi and 
delay d. These processors may fail by stopping. A process is said to detect the failure of 
another processor when it irrevocably decides that the other has failed. A timeout protocol 
is correct if it satisfies two properties for all executions and all processors p and q: (1) if p 
fails and q does not fail, then q eventually detects the failure of p } and (2) if neither p nor q 
fails, then neither p nor q detects the failure of the other. 
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For a given execution a, we say that p detects the failure of q within time T in a if q 
fails at time t in a and p detects the failure of q at time t' < t + T in a,. We say a timeout 
protocol guarantees a detection time of T if for all processors p and q and all executions a 
in which p fails but q does not, q detects the failure of p within time T in a. 

Because in our model each pair of processors is connected by a private bidirectional 
message link, we will assume that the timeout protocol executes independently for each 
pair of processors. We will therefore prove bounds on detection time for a system of two 
processors, p and q. 

5.2.1 Upper bounds on detection time 

An upper bound of 2Cd-\- d is achieved by a simple protocol that works for any link capacity. 
The two processors continually exchange a single token message: when p receives the token 
message from q } it sends a token message back to q } and q does likewise. If a processor 
takes more than (2 d + c 2 )/ci steps without receiving a message, it concludes that the other 
processor is faulty. Because there is at most one message in transit at any time, it is always 
delivered within time d of when it is sent. Clearly a nonfaulty processor is never timed out. 
This protocol guarantees that any failure is detected within time 2Cd + d (to be precise, 
d-\- C(2d-\- c 2 ) -\- c 2 ; recall we approximate d-\- c 2 ~ d): if p fails at time t, then by time t + d 
all of the messages it has sent are delivered to q and q has sent its last message to p; within 
another time c 2 (l + (2d + c 2 )/ci) ss 2Cd } q has taken enough steps to conclude that p has 
failed. 

An upper bound of C 2 d/ fi + Cd + d is achieved by a protocol in which each processor 
sends a message every (d/ fi)/ci steps. A process concludes that the other has failed if it 
has taken more than (Cd/ fi + d)/ci steps without receiving a message. Clearly, the sending 
times of every two messages are separated by at least time d/ fi and therefore, as shown 
in Section 5.1, each message is delivered within time d of when it is sent. The maximum 
amount of time between the delivery of two consecutive messages from a given processor is 
then c 2 (d/ fi)/ci~\-d = Cd/ fi-\-d (if the hrst message is delivered immediately and the following 
message incurs the maximum possible delay, d). This is less than the minimum amount of 
time, Cd/ fi + d + ci, that the other processor waits before detecting failure. This protocol 
guarantees a detection time of C 2 d/ fi + Cd-\-d: if p fails at time t, then by time t + d all of the 
messages it has sent are delivered to q; within another time c 2 (Cd/ fi + d)/ci = C 2 d/ fi + Cd } 
q has taken enough steps to conclude that p has failed. 

Thus we obtain a simple upper bound of min(2Cd + d } C 2 d/ fi + Cd + d). Note that 
2Cd + d < C 2 d/ fi + Cd+ d if and only if fi < C. 
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5.2.2 Lower bounds on detection time 

We now prove a nearly corresponding lower bound of min(2Co? + d/ fi, C 2 d/ fi + Cd + d). 
Note that 2Cd + d/ fi < C 2 d/ fi + Cd + d if and only if fi < C + 1. Thus, the bounds are 
tight except for // < C + 1: when C < // < C + 1, C 2 d/ fi + Co? + J is the best upper bound 
and 2Cg? + d/ fi is the best lower bound; when fi < C , 2Co? + J is the best upper bound and 
2Co? + d/ fi is the best lower bound. 

We hrst prove that there exists some execution in which p runs "fast" (its steps separated 
by time ci), q runs "slowly" (its steps separated by time c 2 ), messages from q to p are delivered 
immediately, messages from ptoq are delayed by at least time d, and some pair of consecutive 
messages from p to q are separated by at least time d/ fi. We prove that such an execution 
exists for any protocol guaranteed to detect failures within any bounded amount of time. 
This is proved below using the properties of the bounded-capacity message links. The idea 
is that if the last component link from p to q delays all messages by d/ fi then the delivery of 
every pair of messages is separated by time d/ fi. Therefore, if each pair of messages sent by 
p were separated by less than d/ fi, then messages would be sent (put onto the link) faster 
than they were delivered (removed from the link). Thus the number of messages sent but 
undelivered and, consequently, the total delay of a message, would grow in time without 
bound. The time between when p fails and when q receives no further messages is therefore 
increased without bound. 

Lemma 5.1 For all B and for any correct timeout protocol that guarantees a detection time 
of B , there exists an execution in which 

1. All consecutive steps of p are separated by c\; 

2. All consecutive steps of q are separated by c 2 ; 

3. All messages from q to p are delayed by time 0/ 

4- All messages from p to q are delayed by at least time d; 

5. For all t , there exists a pair of messages m\ and m 2 sent by p at times t\ and t 2 

respectively, such that t\ > t , t 2 — t\ > d/ fi, and no message is sent by p in the 

interval (ti, t 2 ). 

Proof: For contradiction, suppose not. Fix any execution fi of the protocol in which (i) the 
hrst three timing constraints are satisfied, (ii) each component link from p to q delays each 
message by time d/ fi, and (Hi) no processor fails. Such an execution exists because conditions 
(z), (ii) and (Hi) are independent of each other and within the bounds of the model. Clearly, 
condition (ii) implies that the fourth timing constraint is satisfied — all messages from p to 
q are delayed at least time d. We prove that the fifth condition is also satisfied in fi. To do 
so, assume for contradiction that it is not. 
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First note that because f3 is infinite, p must send an infinite number of messages: if it 
does not, then let mi be the last message that it sends and consider an execution 7 in which 
p fails after sending mi. Because q receives the same messages from p in each execution, 
it cannot distinguish between the two executions and therefore q either does not detect p's 
failure in 7 or erroneously decides that p has failed in f3. 

Recall that a processor can send messages only during steps and p's steps are separated 
by exactly time c\ in f3. It follows that if two consecutive messages are not separated by 
at least time d/fi, then they are separated by at most k = 777 — 1 steps, which is time 
k ■ c\ < dj fi. 

Consider the interval [to, to + x] of execution /3, where x is defined below. Because p 
sends an infinite number of messages and, by assumption, every two consecutive messages 
are separated at most time k ■ ci, process p sends at least [x/{k ■ ci)J messages in this 
interval. But since the last component link delays each message by d/fi, at most ^7- 
messages are delivered in this interval. Thus the number of messages sent but not delivered 

in this interval is at least (^ I) — (^7 — \- I). According to the properties of the message 

links, the last message sent in this interval may not be delivered until all prior messages have 
been delivered. Thus the last message sent by p in this interval may not be delivered until 

timeto + x + -(j^- c — -fj 2). Let x be any number large enough so that --(77; — 77 2) > B 

(recall that k ■ c\ < d/fi). 

We conclude that the last message sent by p in the interval [t 0} t -\-x] of f3 is not delivered 
until after time t + x + B. Since p does not fail in /3, q does not time out p; in particular, q 
does not time out p before time t + x + B. However, before time t + x + B } this execution is 
indistinguishable to q from an execution in which p fails at time t + x and which is otherwise 
identical to f3 at p and q up to times t + x and t + x + B } respectively. Therefore in this 
execution q does not detect the failure of p within time B. This is a contradiction on the 
assumed protocol. ■ 

Our lower bound proof uses the retiming techniques of "shifting" events in time and 
"shrinking" portions of executions that were developed in [AL89] and [LL84]. 

Theorem 5.2 In a system with links of capacity fi and delay d, no correct timeout protocol 
can guarantee failures to be detected within less than time min(2Co? + d/fi, C 2 d/ fi-\-Cd-\- d). 

Proof: Let T = min(2Co? + d/fi, C 2 d/ fi-\-Cd-\-d). For contradiction, assume the existence 
of a protocol that guarantees a detection time of T. We do not make use of the particular 
value of T until the final step of the proof (the construction of execution fi"). We will reach 
a contradiction by showing that there is an execution of the protocol in which p does not 
fail but q decides that it has. 

Let fi be an execution of the protocol whose existence is implied by Lemma 5.1 with 
to = d (7— r). Let nil and m 2 be the two messages specified by the lemma, sent by p 
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p : q 




(fast) ^^ ^ (slow) 



Figure 5.1: Execution f), the existence of which is proved by Lemma 5.1, takes 
the above form except that messages from p to q may be delayed 
more than d and messages may be sent by q at arbitrary times. The 
events of p (q) are on the left (right), with time represented by the 
vertical dimension. An arrow represents a message labelled with its 
delay, with its tail at the time of the send event and its tip at the 
time of the receive event. 



at times t\ and t 2 respectively. Figure 5.1 depicts an example of an execution satisfying 
Lemma 5.1; for presentation, messages from p to q are shown taking exactly time d and 
messages from q to p are shown at arbitrary times. 

Let a be an execution in which (z) events at p are identical to those of fi up to time ti, 
(ii) p fails at time t\ after sending mi, and (Hi) events at q are identical to those of fi up to 
time t\ + d/ fi + d. Clearly a exists, since messages from p to q are delayed by at least time 
d in fi and so q doesn't receive m 2 until t 2 + d > t\ + d/ fi + d. Also, the assumed protocol 
guarantees that in a, q detects the failure of p before time t\ + T . 

The rest of the proof proceeds as follows. By retiming the events of a and /3, we construct 
executions a' and /3', which are indistinguishable to both p and q from a and fi respectively. 
By retiming the events of a', we construct a", which is indistinguishable from a' to q. Finally, 
by retiming the events of /3', we construct /3", which is indistinguishable from fi' to p up to 
the time that it sends m 2 and indistinguishable from a" to q up to the time that it times 
out p in a" Thus, although p does not fail in /3", q times out p in /3", contradicting the 
correctness of the assumed protocol. 
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p 



(fast) 



ti+d/fi 




(slow) 



• tj+X-d 



Figure 5.2: In the region of interest, execution a' is simply a with events of 
processor q occurring earlier in time by d. Because p fails at time 
t\, q detects the failure of p by time t\ + T — d, denoted by the 
circle. 



Construction of executions a' and f3' 

Conceptually, we wish to construct a' from a by letting each event at q occur earlier in time 
by d ("shifting" those events earlier by d). Strictly speaking, this may not be possible for 
all events at q because of initial conditions. However, it is sufficient to shift by d the events 
of q that occur after time t\ in a and "shrink" some interval before t\ in a (i.e., retiming 
the events of the inverval so that q runs fast in that interval of events in a'). In particular, 
we shrink the interval [0, ^r— yd\. Note that by our choice of t = ^r—^d in choosing /3, the 
last event of this interval occurs before m\ is sent at time t\. Leaving all events at time 
unchanged, steps of q in this interval, which take time c 2 in a, are retimed to take time c\ 
in a! . Thus the interval is shrunk by a factor of C and the last event of the interval occurs 
earlier in a 1 by ^r— j-g? — ^r-j-o? = d. Figure 5.2 depicts the suffix of a', showing the shifted 
events of the region in which we shall be interested. 

This execution satisfies the timing constraints on message delivery, since messages sent 
by p (delayed by at least d in a) are received by q at most d earlier in a 1 and hence are 
delayed by at least in a 1 ; messages sent by q (delayed by in a) are sent at most d earlier 
in a 1 and hence are delayed by at most d in a' . 

Execution f3' is constructed similarly, shifting earlier by d the events at q in f3. 
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Because p and q do not know the time between any particular pair of steps they take, 
they cannot distinguish between either a and a' or fi and fi' . It follows that a' and fi' are 
not distinguishable to p up to the point at which it fails and not distinguishable to q up 
to when it receives m 2 in fi' (at least time t 2 > t\ + d/ fi). Also, q } s detection of p's failure 
occurs before time t\ + T — d in a'. 

P : -to-to/c ? 



> (slow) 



(fast) 



ti+d/fi 




ti — -Tr(d—d/fi) 



> (fast) 



• ti + i(T-d) 

Figure 5.3: Execution a" is constructed from a' by mapping the interval [ti — 
(d—d/fi), t\ + (T — d)] of a' to the interval [ti — ^j(d — d/fj,), t\-\- 
jj(T — d)] of a" and appropriately shifting the rest of q's events. 



Construction of execution a" 

Recall that q runs slowly in a and most of a' — its steps are separated by c 2 . We now 
construct a" from a 1 by a retiming certain events at q. Events at p are the same as in a' up 
to time ti, when p fails in both executions; after time ti, the events (message deliveries) at 
p may be defined arbitrarily within the bounds of the model. 

The retiming operation at q maps the interval [t\ — (d — d/ fi), t\ + (T — d)] of a 1 to 
the interval [ti — -^(d — dj fi), t\ + jj(T — d)] of a" by letting q run fast over this interval 
in a" . Events at time t\ in a' also occur at t\ in a"; events in the above interval of a' are 
retimed to occur closer to time t\ by a factor of C . The rest of execution a' — before time 
t\ — (d — dj fi) and after time t + (T — d) — is shifted merely to preserve the step times of 
events on the borders of this interval. To be precise, a" is defined at q for by retiming each 
event that occurs at q at time t' > -^[d in a 1 to occur at q at time t" in a" , where t" is 
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defined as follows: 



t" 



' t' + (d- d/fi) - ±{d - d/fi) if c^d < t' < t x - ( d - d/fi) 
t t + ±(f - t t ) if h-(d- din) <t' <h + (T -d) 

t'-(T-d) + £(T - d) if t' >h + (T- d) 



This execution is illustrated in Figure 5.3. 

Again, we need to shift the events before time t x — (d — d/fj,) while preserving 
initial conditions. To do this we partially undo the shrinking performed on the interval 
[0, -^Zid\ of a. These events were mapped to the interval [0, Qz^d] of a' , in which q runs 
fast, with the last event of the interval occurring exactly time d earlier in a' than in a. 
In a", we need the last event of this interval to occur exactly time (d — d/n) — -^(d — d/n) 
later than in a'. Because this amount is less than d, we are able to do this, in effect 
partially undoing the original shrinking. The timing assumptions for steps of q are 
clearly satisfied. Because the net effect from both shrinking operations is to shift any 
particular event in the interval [0,-^-j-rf] of a earlier by less than d in a", the timing 
assumptions for message delivery are also clearly satisfied, for the reasons outlined in 
the discussion of a' . 

By construction, this retiming operation does not cause violations of the bounds on 
process step times. We now verify that a" is consistent with the timing assumptions for 
message delivery. First note that all events at p before time t\ occur at the same time in 
executions /3, a, a! and a" . We show that for any event at q occurring at time t" in a" and 
at time t in a (and hence at t' = t — d in a') such that t" < t\ and t > ^r— j-o?, we have 
t — d < t" < t. By the retiming mapping above, as a function of t' we have 

t' < t" < t' + (d-d/fi)-—(d-d/fi), 

(this is because t' is mapped forward in time furthest when t' < t\ — ( d — dj //); least when 
t' = ti) which, substituting t' = t — d } gives 

t-d < t" < t-d/fi-—(d-d/fi) < t. (5.1) 

In a, every message from p to q is delayed by at least d. We claim that in a", every 
message from p to q is delayed by at least time and by less time than in a. If a message is 
delivered at q after time t\ in a", then because p sends no messages after time ti, it must be 
sent by t\ (no new message receipts at q have been introduced to a") and hence delayed at 
least time 0; also, events at q after time t\ in a" occur earlier in a" than in a, so the message 
is delayed by less than it is in a. If a message is delivered at q at time t" < t\ in a" then by 
Equation 5.1, it is delivered earlier in a" than in a by not more than d; it follows that the 
message is delayed by at least time in a" . 
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We also claim that the delay of each message from q to p in a" is delayed by at least 
and at most d. In a, all messages from q to p are delayed 0; if in a" they are sent before ti, 
then from Equation 5.1 they are sent earlier (and delayed more) by not more than d. Any 
message sent by q after t\ is defined arbitrarily to be within the bounds of the model. 

Finally, we note that q detects the failure of p before time t\ + jj(T — d) in a" . 



Construction of execution f3" 

We now construct execution f3" in which p does not fail and which is indistinguishable to q 
from a" up to time t\-\--^{T — d). In proving that f3" satisfies the timing assumptions on step 
time and message delivery, we will make use of the fact that T = min(2Co? + d/ fi, C 2 d/ /j, + 
Cd + d). Because q times out p before time t\ + jj(T — d) in a", we conclude that in /3", q 
mistakenly times out the nonfaulty p } contradicting the assumed correctness and completing 
the proof. 

To construct f3" at q we use exactly the same events as in a", up to time t\ + ^{T — d). 
We do not specify the events occurring at q later than this except to say that any message 
sent by p after time t\ is received at q at least time d later. 



•to—to/C 



d-(d 



(fast) < 



(not fast) < 



h + ±(T-d)-d 




> (slow) 



ti — -g(d—d/(i) 



> (fast) 



Figure 5.4: Execution f3" is essentially the same as execution a", except that p 
does not fail; instead, it runs slowly after sending message mi, and 
message m 2 is delayed by d. Because p sends no other messages 
before m 2 , this execution appears the same as a" to q until it 
receives m 2 . 
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At p } we construct fi" from fi' by mapping the interval [ti, t\ + d/ fi] of fi' to the interval 
[t\, ti -\- ^j(T — d) — d] of /3" (p runs fast over this inteval in fi'] it runs more slowly over 
this interval in fi"). Events in this interval are retimed to occur further from time t\ by at 
most a factor of C (as we will show). We do not specify events occurring at p after time 
t\ + -k(T — d) — d except to say that any message sent by q after time t\ — -k(d — dj fi) i 



is 



delivered at p exactly time d later. Thus, fi" is defined at p for t" < t\ + ^{T — d) — d by 
retiming each each event that occurs at time t' in fi' to occur at time t" in /3", where t" is 
dehned as follows: 

f t' if t' < h 

"1 *i + ^t^(*'-*i) if h<t'<h + d/v 

This execution is illustrated in Figure 5.4. 

We now verify that fi" is consistent with the timing assumptions of the model. Note that 
all events of fi" at p before time t\ are the same in fi\ fi, a, c/, and a"; events of fi" at q before 
time t\ + jj(T — d) are by definition the same as in a" . Having already verified the timing 
properties for a", we need verify only the timing properties involving events (processor steps, 
message sends, and message receipts) occurring at p in the interval [t\,ti + ^(t — d) — d]. 
Events occurring at p later than t\ + ^{T — d) — d and at q later than t\ + jj(T — d) are 
inconsequential to the proof and may be scheduled in any way consistent with the bounds 
of the model. 

First, we verify that successive steps of p after t\ are separated by at most c 2 . We show 
that for any interval [t" } t"] of /3", mapped from the interval [t 8 -, tj] of /3, where t\ < t{ < tj < 
U + d/[i, we have t" - t" < C(t 3 - if): 



t'l-t'! = (tj-U) 



< (tj-U) 



i(T -d) -d 
d/ fi 

Uc 2 difi + cd)-d 



dj fi 



(tj - U)C. 



It follows that because any two steps of p are separated by time c\ in fi\ they are separated 
by at most C ■ c 2 = c\ in fi" . 

We now verify that messages sent by p after t\ are within the proper bounds. The hrst 
message sent by p after t\ is m 2} which in fi' (and fi) is sent at t 2 > t\ + d/ fi and thus in fi" 
is sent at t" > t\ + ^{T — d) — d. Messages sent by p after time t\ are specified to be delayed 
by at least time d, so m 2 is not delivered until at least time t\ + jj(T — d). (Note that q 
times out p by this time.) The delivery of m 2 and all subsequent messages by p is consistent 
with our definition of fi" at q. 

We now verify that messages from q to p are within the proper bounds. We analyze these 
messages in three cases according to when they are sent by q in execution fi' (which is the 
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same time as they are sent in a'). 

Case 1 : q sends at time t' < t\ — d in f3'. 
A message sent by q at time t\ — d in f3' is sent and delivered at time t\ in f3. Since the events 
at p are the same in f3" and f3 and a", messages sent by q before t\ — d in f3" are delivered 
to p by ti in /3" and are therefore, by the analysis of a", are delayed by correct amounts. 

Case 2: q sends at time t' in f3' where t\ — d < t' < t\ — ( d — dj ft). 
Such a message is delivered at time t' + d in /3' where ti < t' + J < t\ + c?///. In /3" the 
sending event at 5 is mapped to time t' + ( d — d/ fj.) — -^(d — d,/ fj,), which is less than t\. In 
f3" the delivery event at p is mapped to t\ + jj~(^j(T — d) — d)(t' + d — ti), which is greater 
than t\. Thus, this messages is delayed by at least 0. We show below that it is delayed by 
at most d: 



UT-d)-d,, 
*i+ GK T/ ^ (t' + d-h] 




t l + (d — d/fi) — —(d — d/fi) 



l)t' + 



±(T-d)-d s 
d/^ 

(ti — (d — d/fi) + d — ti) — 



(d-h) 



(d — d/fi) — —(d — d/fi) 

Ly 



t\ — (d — d/fi) + (d — d/fi) — —(d — d/fi) 



sioce*' <<,-(<* -<*/„) and ( ¥ T " rf > ~ <> 



d/fi 



< ti + -{T-d)-d 



^ - 77 (^- d/fi) 



< — (2Cd + d/fi -d) -d+—(d -d/fi) 

= d. 



1 > 



since T < 2Cd + d/fi 



Case 3: q sends at time t' > t\ — ( d — d/fi) in f3'. 
These messages are sent at t" > t\ — jj(d — d/fi) in f3" and thus are dehned to be delivered 
at p exactly time d later. Note that such messages are delivered at p later than time 



ti - —d + j^d/fi + d 



ti + 2 d + j^d/fi — —d — d 

Ly Ly 

h + —(2Cd + d/fi - d) - d 



> h + -{T-d)-d. 

Ly 

This is consistent with our definition of f3" at p. 

Thus we conclude that f3" is a valid timed execution in which p does not fail but q times 
out p. This is a contradiction on the correctness of the assumed protocol. ■ 
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5.2.3 Bounds for two processors using a single message link 

We remark that our techniques give a tight upper and lower bound of C 2 dj fi + C d + d for a 
system of two processors with a message link in only one direction. 

In such a system, we have two processors, p and q, and a single message link of capacity 
fi from p to q. Naturally, a protocol does not need to detect failures of q. All other previous 
definitions apply. 

The second simple protocol described in Section 5.2.1 operates independently in each 
direction. It immediately gives a protocol for the unidirectional case, guaranteeing that in 
any execution, q detects the failure of p within time C 2 d/ fi + C d + d. 

It is also not difficult to see that our lower bound proof of Theorem 5.2 specializes to the 
unidirectional case to give a corresponding lower bound of C 2 d/fi + C d + d. Theorem 5.2 is 
proved for T = min(2Co? + d/ fi, C 2 d/ fi + Cd + d). A similar theorem for the unidirectional 
case may be proved with T = C 2 d/ fi + Cd + d. Recall that in that proof, the value of the 
timeout detection time T guaranteed by the protocol is not used before the claims about 
execution fi" . All preceding claims except those involving messages from q to p carry over a 
fortiori. Lemma 5.1, for example, is true also for the unidirectional case with the exception 
of its third condition, which regards messages from q to p. The proof of Theorem 5.2 uses 
the fact that T < 2Cd + d/ fi in claims about fi" only to verify bounds on the delay of 
messages from q to p. This analysis is not needed for a theorem about the unidirectional 
case and hence the entire proof specializes to the unidirectional case to give a lower bound 
of C 2 d/fi + Cd + d. 
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5.3 Consensus with bounded-capacity message links 

We remark on how our upper bounds for consensus are affected by bounding the capacity of 
the message links used. 



5.3.1 Byzantine failures 

Because it is not message-intensive, our algorithm for Byzantine failures is not affected by 
the restriction of bounded-capacity message links. Recall that the algorithm for Byzantine 
failures does not include a fault-detection task and does not require a process to send a 
message at every step it takes. The correctness follows from the fact that at least time 
2d passes between the time that the round r — 1 messages of all nonfaulty processes are 
delivered and the time that any nonfaulty process advances to round r + f. The round r 
message of a nonfaulty process can incur more than delay d only if it is sent before the 
previous message is delivered. The previous message is its round r — 1 message, so even if 
the round r message incurs added delay, it is still delivered by time 2o? (actually, d) after 
the round r — 1 messages of all nonfaulty processes are delivered, and all nonfaulty processes 
receive it before advancing to round r + f. Otherwise, if the round r message does not 
incur added delay due to the capacity of the message link, the proof of Lemma 4.2 holds as 
before. Because nonfaulty processes do not send any messages other than the messages of 
the synchronous algorithm, it is easy to see that the delay of messages does not affect the 
proof of Lemma 4.3, and the running time is not affected. 



5.3.2 Omission failures 

First note that if every pair of messages sent by a process are separated by at least time 
d/ fi, then each message is delayed by at most time d and the omissions algorithm is not 
affected (the analysis of Chapter 3 holds). Because the fault-detection protocol requires a 
process to send a message at every step it takes, messages may be separated by as little as 
time ci; therefore the omissions algorithm is not affected if the message links are of capacity 
fi > djc\. 

The first consequence of using links of capacity // < &jc\ is the obvious effect on the 
fault-detection protocol. Instead of the bound Cd-\- d guaranteed for the time until a failure 
is detected (shown in Lemma 3.1), a bound of only min(2Co? + d/ fi, C 2 d/ /j, + Cd + d) can 
be guaranteed by the fault-detection protocol. Lemmas 3.7 and 3.18 then also involve the 
above expression instead of Cd + d. 
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But a more serious affect on the running time of the algorithm is the added delay between 
when a process "should" send a message (according to the main algorithm) and when it may 
send it. A crucial element of the algorithm is the piggybacking of messages of fault-detection 
task and messages of the main algorithm. A straightforward implementation would require 
that messages of the main algorithm can only be sent during steps in which a message of 
the fault-detection task is to be sent. 

If the first timeout task, in which each pair of processes continually sends a "token" back 
and forth, were used for the fault- detection protocol, up to time 2d may elapse between 
when a process is required to send a message of the main algorithm and when it is able to 
piggyback that message onto a message of the fault-detection task. Thus each message may 
in effect be delayed by a total of 3d, since the timeout task ensures that all messages are 
delivered within time d of when they are sent, despite the capacity of the message links. 
This gives bounds of 



i(f + l)3d + 2Cd for n > 2f + 1 
(3 7 ^j + 5)(f + l)3d + 2Cd for n<2f. 

The bound of Section 3.5.2 can be slightly modified to give a bound of 



(2V2C + 6)(f + l)3d + 2Cd for n<2f. 

Using the second timeout task, in which a process waits for d/(fiCi) steps between every 
pair of messages, adds a delay of up to C(d/ fi) to every message. Each message is then in 
effect be delayed by a total of up to dil + C / fi). This gives bounds of 

4(f + l)(l + C/fi)d+(C 2 d/fi + Cd) for n>2f + l 



anc 



(3^j + 5)(/ + 1)(1 + C/fi)d + {GHIfi + Cd) 
and (2C/y/ji + 2^/C + 6)(/ + l)d + (C 2 d//i + Cd) 



for n < 2f. 

It may be possible that a more clever stategy would allow processes to send messages 
of the algorithm on demand by more closely intertwining the main algorithm and the fault- 
detection task. 
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Chapter 6 



Conclusions 



We first summarize the known bounds for consensus: 



Failure type 


n > 


L< 


ywer bound 


Upper bound 


Reference 


Stopping 


/ + 1 


(/ 


-l)d + Cd, 
(f + l)d 


2(f + l)d + Cd 


[ADLS90] 


Omissions (sending) 


2/ + 1 
/ + 1 




T 
T 


4(f + l)d + Cd 
(3^j + 5)(f + l)d + Cd, 
{2Vb+Q){f + l)d + Cd 


Thm. 3.21 

Thm. 3.25 
Thm. 3.28 


liming 


/ + 1 




T 


(f + l)(Cd + d) 


(see below) 


Byzantine 


3/ + 1 




T 


(f + l)(d + 2Cd) 


Thm. 4.4 


Auth. Byzantine 


2/ + 1 
/ + 1 


(fe 


T 


(f + l)(d + 2Cd) 
(C f+1 + C f + --- + C)d 


(see below) 
(see below) 



The bounds for stopping and omission failures (for n > 2/ + 1) are tight to within approx- 
imately a constant factor (2 and 4, respectively). The bounds for omission failures when 
n < 2f are not tight; an improvement in either direction would be interesting. It has been 
noted by Bharali ([B91]) that the running time for omissions failures can be improved to 
3(/ + l)d + Cd by the following modification to the algorithm. The improvement is ob- 
tained by reducing the delay caused by a process that must wait for acknowledgments before 



°This is an updated version of the original Chapter 6. It differs by the inclusion of the following: the 
upper bound for authenticated Byzantine failures when n > 2/+ 1, the improvement of the constant from 
4 to 3 for the running time of the omissions algorithm ([B91]), and the more careful analysis of the running 
time in the model of [HK89]. 
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sending an r message. Recall that if pi sends an r — 1 message at exactly t r -\ — the latest 
possible time — and immediately thereafter receives an r message, it may have to wait until 
time t r -\ + 2d to receive enough acknowledgments for its r — 1 message before sending an 
r message and advancing to phase r + 1. Thus a process pj receiving the r message from pi 
would not receive it until t r -\ + 3 d. The idea is to let pi send a "virtual r message", even 
though it has not yet received n — f acknowledgments for its r — 1 message. Process p 3 
does not treat a virtual r message from pi as a regular r message until it sees that pi has 
received enough acknowledgments for its r — 1 message (recall that all messages, including 
acknowledgments are broadcast to all processes). Thus, if pi does get enough acknowledg- 
ments, then both pi and pj receive them by time t r -\ + 2d and pj has effectively received a 
(real) r message from pi by time t r -\ + 2d } saving time d. 

For failures less benign than omissions, this thesis leaves open a large gap in time com- 
plexity. In particular, the following central question remains unanswered: 

Does consensus in the presence of Byzantine failures require time ft(fCd)? 

The difficulty of this problem seems to lie not in the potential of for arbitrary message 
content but in the potential for timing misbehavior. We believe an important step towards 
answering this question will be to obtain tight bounds for "timing failures", described below. 

Timing failures 

We say that a process suffers a "timing failure" if the time between some pair of successive 
steps is not in the interval [ci,c 2 ]. The simple direct rounds simulation hrst described in 
Chapter 3 tolerates timing failures as well, implying a consensus algorithm with running 
time (/ + l)(Cd-\- d). The algorithm of [ADLS90] is also correct despite timing failures, but 
each of its phases may take up to time Cd + d. In fact, no algorithm is known to tolerate 
timing failures in less than time O(fCd). 

Byzantine failures with authentication 

First note that the direct simulations outlined at the beginning of Chapter 3 do not work in 
the presence of Byzantine failures, even with authentication, and our general simulation of 
Chapter 4 itself requires n > 3/ + 1. 

The upper bound for authenticated Byzantine failures with n > 2/ + 1 can be obtained by 
a very simple modification of the algorithm for Byzantine failures: change "If \V r \ > 2f -\- 1" 
to "relay V r+lv (unconditionally). In other words, to ensure that every process receives / + 1 
round r messages (and therefore sends its own round r message), it is sufficient to relay the 
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/ + 1 round r messages already received — these messages are signed and therefore believable. 
This protocol works for n > 2/ + 1 and achieves the same time complexity. 

When n < 2/, the only obvious algorithm to tolerate authenticated Byzantine failures 
is a costly simulation of a synchronous algorithm. The simulation requires that processses 
begin synchronized and time out each other's timeouts by terminating round i after (C 8_1 + 
• • • + C + l)d/ci steps. 

The lower bound for authenticated Byzantine failures, not presented in this thesis, is 
interesting (greater than Cd) only for the limited range of n < 2/, and therefore says 
nothing interesting about unauthenticated Byzantine failures. The proof of this bound is 
similar to the "shifting scenarios" proofs of [FLM86]. 

Before suggesting other directions for further research, we hrst comment on the implica- 
tions of our bounds for consensus in a closely related model. 

6.1 Consensus in the related model of [HK89] 

Herzberg and Kutten [HK89] consider a model in which the actual worst-case message delay 
in a given execution, <5, may be much less than the a priori worst-case bound on message 
delay, A. It is thus desirable for the running time of algorithms to depend minimally on A. 
This model raises similar concerns as our model does; in particular, detecting the absence of 
a message may be much more expensive than receiving the message. 

For the consensus problem, it is not difficult to see that direct implementation of syn- 
chronous algorithms gives a running time of O(fA) for any type of failures; on the other 
hand, clearly the synchronous lower bound implies that no algorithm can guarantee a running 
time of less than (/ + 1)8. In this model, our algorithms yield an improvement over direct 
simulation strategies similar to the corresponding improvement in our semi-synchronous 
model. 

It is not difficult to see that our algorithms may be run without modification in the model 
of [HK89], yielding the same running times with the syntactic substitution of A for Cd and 
8 for d. Thus we obtain bounds of 



4(/ + 1)8 + A for n > 2/ + 1 

L ? 

n-f 



(3^t + 5)(/ + 1)«5 + A for n<2f. 



The bound involving v C carries over with \J Aj 8 instead of yC, giving a bound of (2 J A/ 8+ 
6)(/ + 1)8 + A. If 6 <C A, these are significant improvements over the bounds obtainable 
by direct simulation. Moreover, it is possible to prove better bound for these algorithms 
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in general, depending on the ratio of A to 6; the bound of 4(/ + l)d + Cd is realized by 
an execution only if A = 4<5. This is because process clocks are perfectly synchronized: 
whereas in our model the time between failure and detection may be anywhere in the range 
[d, d-\- Cd] } in this model it must be A (plus or minus twice the step time, which is assumed 
to be much less than 6; see [ADLS90], §7).) The length of each phase except the last must 
therefore be A and must have at least A/ 6 — 3 failures (processes in B r ). There are thus at 
most (/ + l)/(A/<5 — 3) phases except for the last phase, and the running time is at most 
( A _ 3 g )(J + 1)<5 + A. The running time is the maximum of this expression and 4(/ + 1)8 + A. 
Similarly for the stopping failures algorithm of [ADLS90], the running time is the maximum 
of 2(f + l)d) + Cd and (-^Zg)(f + 1)^ + A- Our algorithm for Byzantine failures is not 
interesting in this model, as it is trivial to design an algorithm taking only time (/ + 1)A 
(our algorithm takes (2/ + 1)A + fS). 

In comparison, the algorithms of [DLS88] may also be used in the model of [HK89]. For 
stopping and omission failures (sending and receiving), their algorithms require n > 2/ + 1; 
for Byzantine failures, they require n > 3/ + 1. Their algorithms assume only that an upper 
bound on message delay time exists — it may not be known to the processes; the running time 
is a function of the maximum message delay in the given execution. The running times are 
0(6 2 -\-n 2 ) for all types failures. (As noted in [DLS88], the running times can be improved to 
0(6 2 + f 2 ). We also note that the different model considered there, which enables processes 
to send to at most one process per step, does not affect the time bound asymptotically.) 

Note that in contrast to our algorithm and the algorithm of [ADLS90], the running times 
of the algorithms of [DLS88] in the [HK89] model do not depend at all on A, the upper bound 
on message delay time. This is possible because the model of [HK89] provides an extra degree 
of power to the algorithm by assuming that process clocks are perfectly synchronized. The 
algorithms of [DLS88] do not give good bounds in our model; the running times depend only 
polynomially on the ratio of process step rate, C. This difference in the model also accounts 
for the simplicity of solving consensus in the presence of Byzantine failures in the [HK89] 
model, relative to our semi-synchronous model. 



6.2 Directions for further research 

There are many possible directions for interesting research addressing the issues and concerns 
of real-time behavior of distributed systems: 

• The existence of the underlying synchronous algorithm described in Section 3.2 suggests 
that the results of [ADLS90] and this thesis may be generalizable to certain classes of 
synchronous algorithms. For instance, the properties of the underlying synchronous 
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algorithm that make it amenable to "efficient" simulation in our model are that it is 
"early-stopping" and that processes advance to further rounds only because of messages 
received (not because of messages omitted). 

— Can the properties sufficient for efficient simulation be clearly characterized? 

— Can these properties be shown necessary by proving lower bounds with large 
dependency on C for synchronous algorithms lacking these properties? 

— Are the factors of — ^ and \/C inherent to such simulations with n < 2 f and 

n-f v —J 

tolerating omission failures? 

• What classes of problems are in fact affected by timing uncertainty? Perhaps problems 
solvable in asynchronous systems need not be affected. Can they be helped by timing 
assumptions? Are only fault-tolerant problems affected? 

• Similar questions can be asked in the context of the model of Herzberg and Kutten 
([HK89]): What can be said about converting synchronous algorithms with running 
times as a function of message delay d to algorithms that depend on the actual worst- 
case message delay 8 rather than the a priori worst-case message delay A? 

• What can be said about simulating synchronous algorithms that do not operate in 
rounds? 

• Other work ([SDC90]) on the real-time complexity of the consensus problem assumes 
a different model of semi-synchrony There, continuous local clocks are assumed to 
be within a fixed constant e of each other and to stay within a linear envelope of real 
time. Insight into how these two models are related would enable a comparison of the 
bounds that have been obtained. In particular, using the assumptions of our model, 
for what values of their parameters can their model be implementable? 

• We have given a straightforward implementation of our consensus algorithm using 
bounded capacity message links. Can a more involved approach avoid merely effectively 
increasing the delay of each message? 

• For other problems, can bounded capacity message links be used to control implicitly 
message complexity by causing message inefficiency to be manifested as time ineffi- 
ciency? 
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